피파 한 줄 정리: Rubber body, melting, drift, teleport, physics 위반 — 모델은 'walking이 어떻게 *생겼는지*' 알지, '관절 제약'을 모르거든. 단순한 motion일수록 안전해.
Mental model: Imagine asking someone to animate a person walking by drawing 30 frames — but they've never studied animation and only have 2 seconds per frame. Early frames might look great, but as fatigue sets in, the legs start to stretch, the arms drift to the wrong rhythm, the head bobs unnaturally, and suddenly the person looks like they're made of rubber. That's what happens when a video model struggles with motion coherence — it knows what movement looks like in aggregate, but it can't maintain the physical rules that govern smooth, natural motion.
Common Motion Failures
- Rubber body effect: Limbs stretch and compress unnaturally, as if the character is made of elastic. This happens because the model interpolates between key poses without understanding skeletal constraints.
- Object melting: Solid objects gradually lose their shape, edges becoming soft and boundaries merging with the background. The model's attention to the object weakens over frames.
- Drift: Objects slowly slide in a direction no one asked for. A person standing still gradually drifts to the left. A vase on a table slowly migrates toward the edge. The model's spatial anchor loosens over time.
- Teleportation snaps: Between one frame and the next, an object or limb jumps to a new position. This happens when the model "loses track" and re-generates the element in a slightly different location.
- Physics violations: Hair falls upward, fabric hovers, water flows in the wrong direction, dropped objects pause mid-air. The model has no physics engine — it's pattern-matching motion it's seen in training data.
Natural Motion: Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 🚶 🚶 🚶 🚶 🚶 (smooth, consistent stride, body proportions stable) AI Motion Failures: Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 🚶 🧍♂️ 🚶 🏃 🧟 (stride (suddenly (back to (speed (rubber okay) stops) walking) changed) limbs)
Why Motion Is Harder Than Appearance
The model learned motion from video training data, but video captures motion implicitly as pixel changes between frames. The model doesn't have an explicit concept of velocity, acceleration, or physical forces. It knows that "walking" involves leg movements that follow certain pixel patterns, but it doesn't understand that legs have joints with limited range of motion, or that momentum must be conserved.
This is similar to the hand problem in image generation (Track 4, Lesson 3), but extended through time. Each frame compounds the uncertainty of the previous frame. Small motion errors accumulate into large visible artifacts.
What Helps
- Simple, predictable motions: Slow head turns, gentle walks, static poses with environmental motion (wind, water).
- Image-to-video anchoring: Starting from a strong first frame constrains the motion space.
- Short clips: 2–4 seconds allows the model to maintain coherence. Longer clips amplify drift.
- Steady camera: Camera movement adds another variable. Keeping the camera static simplifies the model's task.
- Motion failures include rubber bodies, melting objects, drift, teleportation snaps, and physics violations.
- Models learn motion as pixel patterns, not physics — they have no skeletal model or force simulation.
- Small motion errors compound across frames, making longer clips less reliable.
- Simple motions, image-to-video anchoring, and short clips reduce motion artifacts.