피파 한 줄 정리: Video는 'image × N'이 아니라 *temporal coherence* 문제야. Spatial × temporal × motion × persistence — multiplicative하게 어려워.
Mental model: Imagine you're making a flipbook. If you draw each page independently — without looking at the previous page — you'd get 30 beautiful individual drawings, but when you flip through them, the character would teleport, change size, shift color, and jitter wildly. That's what happens if you just run an image generator 30 times. Video requires each frame to be aware of all the other frames. That's an entirely different computational problem.
The Dimension Explosion
An image is a 2D grid of pixels. A video is a 3D tensor: width × height × time. Adding the time dimension doesn't just add more data — it adds relationships that must be maintained:
Image: Width × Height = ~1 million pixels to get right
Video: Width × Height × Frames = ~30 million pixels to get right
PLUS every pixel must be consistent with its neighbors
in BOTH space AND time
┌────────────────────────────────────────────────┐
│ Frame 1 Frame 2 Frame 3 ... Frame N │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ │──→│ │──→│ │──→ ... →│ │ │
│ │ │ │ │ │ │ │ │ │
│ └─────┘ └─────┘ └─────┘ └─────┘ │
│ spatial spatial+ spatial+ spatial+ │
│ only temporal temporal temporal │
└────────────────────────────────────────────────┘
What "Consistency Across Time" Means
For a video to look right, the model must maintain consistency across every single frame for:
- Identity: A person must look like the same person in every frame — same face, proportions, skin tone.
- Motion: Movement must follow physics — smooth acceleration, natural arcs, consistent speed.
- Lighting: Shadows, highlights, and color temperature must evolve naturally with camera/subject movement.
- Environment: The background, props, and scene must stay stable unless explicitly moving.
- Clothing: Garments must move with the body, not spontaneously change color or shape.
- Physics: Hair flows, fabric drapes, water splashes, smoke dissipates — all must obey physical intuition.
A single image only needs to be internally consistent at one moment. A video needs to be internally consistent at every moment AND consistent between all moments. The constraint space explodes.
The Progress Is Real
Despite the difficulty, video generation has advanced dramatically. As of 2026, models like Runway Gen-4, Veo 3, and Kling 3.0 produce clips that would have been considered impossible two years ago: stable character identity, coherent motion, even synchronized audio. The field evolved from "impressive but unreliable" in 2024 to "production-grade for short clips" in 2026. But understanding why it's hard helps you work within its limits effectively.
- Video is not "image generation repeated" — it requires temporal coherence across all frames.
- The time dimension adds constraints: identity, motion, lighting, environment, and physics must all stay consistent.
- This transforms the problem from 2D consistency to 3D consistency (space + time).
- Despite the difficulty, 2025–2026 models have reached production quality for short clips.