피파 한 줄 정리: Background object가 사라지거나 위치가 바뀌는 이유 = attention budget이 main subject로 몰려서. 클립 짧게, 배경 단순하게, camera 정적으로.
Mental model: Anybody can take one great photo. But filming a 30-second commercial where every frame looks professional, the lighting stays perfect, the product stays centered, and the model's hair doesn't shift — that's a whole different skill. The gap between "one beautiful frame" and "five seconds of beautiful frames" is enormous, and it's where most video generation struggles become visible.
Why Good Frames ≠ Good Video
Consider what must hold true across even a short 3-second clip at 24fps (72 frames):
- Every frame must be individually high quality (no artifacts, good composition)
- Adjacent frames must flow smoothly (no jumps, no flicker)
- Distant frames must maintain identity (frame 1 and frame 72 show the same person, same clothes, same environment)
- Motion must be physically plausible throughout
- Scene elements must persist (a vase on the table in frame 1 must still be there in frame 72, in the same position if nobody moved it)
A single image only needs to satisfy internal spatial consistency. A video must satisfy spatial consistency × temporal consistency × motion coherence × object persistence × environmental stability. Each additional requirement is multiplicative, not additive.
Object Persistence
One of the subtlest but most noticeable continuity failures is object persistence. In a real video, a coffee cup on a desk stays there unless someone moves it. In AI video, the cup might:
- Gradually fade or blur away
- Shift position slightly between frames
- Change shape or color
- Disappear entirely when the camera looks away and reappears differently when it returns
This happens because the model doesn't maintain an internal 3D model of the scene. It generates each frame based on learned patterns, and small background elements have weak attention signals that drift over time.
The Attention Budget
A useful way to think about continuity is as an "attention budget." The model has a finite amount of attention to distribute across the scene. The main subject gets most of it. Secondary elements (background, props, environmental details) get less. The further from the center of attention, the more likely an element is to drift, change, or disappear.
- A good single frame is much easier to generate than a good 5-second clip.
- Continuity requires spatial consistency × temporal consistency × motion × persistence — multiplicative difficulty.
- Background elements are weakly constrained and prone to drift, change, or disappear.
- Think of the model's attention as a budget — the main subject gets most of it, everything else gets less.