C.W.K.
Stream
Lesson 01 of 10 · published

Video Is NOT Just Image Generation Repeated

~12 min · video, temporal, l1

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: Video는 'image × N'이 아니라 *temporal coherence* 문제야. Spatial × temporal × motion × persistence — multiplicative하게 어려워.

Mental model: Imagine you're making a flipbook. If you draw each page independently — without looking at the previous page — you'd get 30 beautiful individual drawings, but when you flip through them, the character would teleport, change size, shift color, and jitter wildly. That's what happens if you just run an image generator 30 times. Video requires each frame to be aware of all the other frames. That's an entirely different computational problem.

The Dimension Explosion

An image is a 2D grid of pixels. A video is a 3D tensor: width × height × time. Adding the time dimension doesn't just add more data — it adds relationships that must be maintained:

Image:  Width × Height = ~1 million pixels to get right
  
  Video:  Width × Height × Frames = ~30 million pixels to get right
          PLUS every pixel must be consistent with its neighbors
          in BOTH space AND time

  ┌────────────────────────────────────────────────┐
  │  Frame 1    Frame 2    Frame 3    ...  Frame N │
  │  ┌─────┐   ┌─────┐   ┌─────┐        ┌─────┐ │
  │  │     │──→│     │──→│     │──→ ... →│     │ │
  │  │     │   │     │   │     │        │     │ │
  │  └─────┘   └─────┘   └─────┘        └─────┘ │
  │  spatial    spatial+   spatial+      spatial+ │
  │  only       temporal   temporal      temporal │
  └────────────────────────────────────────────────┘

What "Consistency Across Time" Means

For a video to look right, the model must maintain consistency across every single frame for:

  • Identity: A person must look like the same person in every frame — same face, proportions, skin tone.
  • Motion: Movement must follow physics — smooth acceleration, natural arcs, consistent speed.
  • Lighting: Shadows, highlights, and color temperature must evolve naturally with camera/subject movement.
  • Environment: The background, props, and scene must stay stable unless explicitly moving.
  • Clothing: Garments must move with the body, not spontaneously change color or shape.
  • Physics: Hair flows, fabric drapes, water splashes, smoke dissipates — all must obey physical intuition.

A single image only needs to be internally consistent at one moment. A video needs to be internally consistent at every moment AND consistent between all moments. The constraint space explodes.

The Progress Is Real

Despite the difficulty, video generation has advanced dramatically. As of 2026, models like Runway Gen-4, Veo 3, and Kling 3.0 produce clips that would have been considered impossible two years ago: stable character identity, coherent motion, even synchronized audio. The field evolved from "impressive but unreliable" in 2024 to "production-grade for short clips" in 2026. But understanding why it's hard helps you work within its limits effectively.

Key Takeaways
  • Video is not "image generation repeated" — it requires temporal coherence across all frames.
  • The time dimension adds constraints: identity, motion, lighting, environment, and physics must all stay consistent.
  • This transforms the problem from 2D consistency to 3D consistency (space + time).
  • Despite the difficulty, 2025–2026 models have reached production quality for short clips.

External links

Exercise

좋아하는 5초 video 골라. 24fps에서 30 still frame으로 storyboard. 어느 frame이 모델의 'free moment' (camera 정적·environmental motion)이고 어느 게 'hard moment' (subject articulation·복잡한 physics)인지 상상. 비디오 모델의 difficulty profile을 쓴 거야.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.