피파 한 줄 정리: Lip sync 100ms 허용 오차. 그 이상 = distracting. Native sync (Veo 3) vs post-sync vs audio-driven video 세 가지 접근.
Mental model: Watch a dubbed foreign film where the audio doesn't match the lip movements. Even if the voice acting is excellent, the mismatch is deeply distracting — your brain screams "something is wrong" even before you consciously notice it. That's because humans are extraordinarily sensitive to audio-visual synchronization. We evolved to read lips, match sounds to mouth shapes, and detect misalignment as a threat cue. Any media that combines voice and face must get sync right, or nothing else matters.
Why Sync Matters Psychologically
Research shows that audio-visual misalignment as small as 100 milliseconds is perceptible to most viewers. At 200ms, it becomes distracting. At 500ms, it feels obviously broken. This is a tighter tolerance than most other aspects of video quality — you'll forgive slightly wrong lighting or imperfect shadows long before you'll forgive bad lip sync.
This matters for generative media because combining AI-generated video with AI-generated voice requires precise synchronization that was historically handled by separate tools, with sync as an afterthought.
Three Approaches to Lip Sync
1. Native Synchronized Generation
The model generates video and audio simultaneously, with built-in synchronization. Veo 3 pioneered this approach in 2025, producing video where character lip movements match generated dialogue natively.
- Pros: Sync is built-in, no post-production alignment needed, most natural result.
- Cons: Less control over voice and video independently, limited to models that support it.
2. Post-Generated Lip Sync
Generate video and audio separately, then use a lip-sync tool to match them. Tools analyze the audio, extract phoneme timings, and warp the character's mouth to match.
- Pros: Maximum flexibility — use any video model with any voice model. Can retrofit lip sync onto existing content.
- Cons: Can look artificial if the warping is too aggressive, requires an additional processing step.
3. Audio-Driven Video Generation
Provide the audio first, then generate video that matches it. The model receives voice audio and an image/video reference, producing a talking-head video synchronized to the speech.
- Pros: Voice quality is controlled independently. Video generation anchors to the audio timing.
- Cons: Limited to talking-head scenarios. Complex scenes with talking characters are harder.
Approach 1: Native Approach 2: Post-Sync Approach 3: Audio-Driven
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Single │ │ Video Gen │ │ Voice Gen │
│ Prompt │ │ (separate) │ │ (first) │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
▼ ┌─────┴──────┐ ▼
┌────────────┐ │ Voice Gen │ ┌────────────┐
│ Video + │ │ (separate) │ │ Video Gen │
│ Audio │ └─────┬──────┘ │ (driven by │
│ (synced) │ │ │ audio) │
└────────────┘ ┌─────┴──────┐ └────────────┘
│ Lip-Sync │
│ Tool │
└────────────┘
- Humans detect audio-visual misalignment at ~100ms — sync tolerance is extremely tight.
- Three approaches: native generation (Veo 3), post-sync tools, and audio-driven video.
- Native sync is most natural but less controllable. Post-sync is most flexible but requires extra steps.
- For talking-head content, generate voice first, then video — this gives the most control.