피파 한 줄 정리: Identity flickering, texture swimming, color pulsing, object popping — 다 temporal attention이 부족해서 생기는 specific artifact들. Clip 길어질수록 drift 심해져.
Mental model: Think of temporal consistency like a relay race where each runner must wear the exact same outfit, run at the exact same speed, maintain the exact same stride, and look exactly identical to the previous runner. If any runner changes anything — slightly different shoes, a slightly different pace, a slightly different face — the audience notices immediately. That's what video models must achieve, frame after frame after frame.
What Breaks When Consistency Fails
Temporal inconsistency manifests as specific, recognizable artifacts:
- Identity flickering: A person's face subtly changes between frames — slightly different eye shape, jawline, or skin tone. Individually each frame looks fine, but in motion it creates an unsettling "shifting face" effect.
- Texture swimming: Surface details (fabric patterns, wall textures, skin pores) appear to slide across surfaces instead of staying anchored. This makes objects look like they're coated in moving liquid.
- Color pulsing: The overall color temperature or saturation oscillates frame-to-frame. The video appears to "breathe" in color.
- Object popping: Background elements appear and disappear between frames, or suddenly change shape or position.
- Shadow inconsistency: Shadows shift direction, intensity, or shape in ways that don't match the lighting or motion.
How Models Achieve Temporal Consistency
Modern video models use architectural features specifically designed for temporal coherence:
- Temporal attention layers: In addition to spatial attention (which lets each pixel "see" other pixels in the same frame), temporal attention lets each pixel "see" the corresponding location in adjacent frames. This creates explicit dependencies across time.
- 3D convolutions: Standard image models use 2D convolutions (sliding a filter across width and height). Video models use 3D convolutions that slide across width, height, AND time, naturally enforcing local temporal smoothness.
- Diffusion Transformer (DiT) architectures: Used by Veo 3, Sora, and others, these process entire video sequences as token sequences, allowing global attention across all frames simultaneously.
Consistency vs. Clip Length
An important practical truth: consistency degrades with clip length. A 2-second clip can maintain near-perfect consistency. A 5-second clip starts showing subtle drift. A 10-second clip often has visible inconsistencies. Current models (2026) handle:
- 1–4 seconds: Very high consistency. Most models do this well.
- 5–15 seconds: Good consistency with top models (Runway Gen-4, Veo 3). Some drift possible.
- 15–60 seconds: Possible with the best models, but requires careful prompting and often shows some artifacts.
- 60+ seconds: Still challenging. Most workflows stitch shorter clips together rather than generating long continuous takes.
- Temporal consistency means every visual property stays stable across frames.
- Failures manifest as flickering faces, swimming textures, color pulsing, and object popping.
- Models achieve consistency through temporal attention, 3D convolutions, and DiT architectures.
- Consistency degrades with clip length — shorter clips are more reliable.