피파 한 줄 정리: Camera motion (geometric)은 비교적 쉬워. Subject motion (articulation)은 어려워. 둘 다 동시에 = complexity 곱셈. 'one mover at a time' 룰을 지켜.
Mental model: In real filmmaking, the camera operator and the actor are independent agents. The camera can dolly forward while the actor stands still. The actor can walk while the camera stays locked on a tripod. Or both can move simultaneously in a choreographed dance. In AI video generation, these two types of motion are completely different challenges, and conflating them is one of the most common mistakes beginners make.
Camera Motion: Moving the Viewpoint
Camera movements change what the viewer sees by shifting the perspective:
- Pan: Camera rotates horizontally (looking left to right)
- Tilt: Camera rotates vertically (looking up or down)
- Dolly/Push-in: Camera physically moves toward or away from the subject
- Truck: Camera moves laterally (sliding left or right)
- Crane: Camera moves vertically (rising or descending)
- Orbit: Camera circles around the subject
Camera motion is relatively easier for video models because the entire scene transforms in a geometrically consistent way. When a camera pans left, every pixel shifts in a predictable pattern. The model can learn these global transformations from training data.
Subject Motion: Objects and People Moving
Subject motion is where things get hard. A person walking involves coordinated limb movement, weight shifting, fabric dynamics, hair bouncing, shadow changes — dozens of interrelated physical processes happening simultaneously. Each of these must be individually correct AND consistent with each other AND consistent across frames.
Camera Motion (easier): Subject Motion (harder): ┌─────────────────────────┐ ┌─────────────────────────┐ │ Whole scene shifts │ │ One element moves, │ │ uniformly — geometric │ │ everything else must │ │ transformation │ │ react correctly │ │ │ │ │ │ Pan: all pixels shift → │ │ Walk: legs, arms, head, │ │ Zoom: all pixels scale ⊕ │ │ fabric, hair, shadow, │ │ Tilt: all pixels shift ↑ │ │ ground contact, weight │ └─────────────────────────┘ └─────────────────────────┘
The Danger of Combining Both
The hardest prompts ask for simultaneous camera AND subject motion: "camera dollies forward as a woman walks toward the camera, wind blowing her hair." This requires the model to:
- Generate natural walking motion (legs, arms, body sway)
- Simulate camera approaching (perspective shift, parallax)
- Handle the interaction (subject gets larger in frame as camera approaches)
- Plus secondary motion (hair in wind, fabric flowing)
Each additional motion layer compounds complexity. The most reliable strategy is to separate them: either move the camera OR move the subject, not both, unless using a top-tier model with proven dual-motion capability.
"Camera orbits around a dancer performing a complex spin while confetti falls and spotlights sweep across the stage"
"Static camera, medium shot: a dancer performs a slow, graceful turn in place, soft spotlight, minimal stage decoration"
- Camera motion (pan, dolly, orbit) is geometrically predictable and relatively easier for models.
- Subject motion (walking, gesturing, dancing) involves complex physics and is much harder.
- Combining camera + subject motion multiplies complexity and failure rate.
- "One mover at a time" is the safest rule for reliable video generation.