피파 한 줄 정리: T2V (탐색·자유), I2V (production·anchor), V2V (style transfer·motion 보존). 셋이 다른 problem을 풀어.
Mental model: These three modes are like three different levels of guidance you can give a film crew:
- Text-to-video: "Film a sunset over the ocean with gentle waves." The crew interprets everything — location, camera angle, timing, mood. Maximum creative freedom, minimum control.
- Image-to-video: "Here's a still photo of this exact beach at this angle. Now animate it." The crew has a visual starting point — composition and mood are locked. They just add motion.
- Video-to-video: "Here's a rough clip I shot on my phone. Make it look like a cinematic film." The crew has the full motion and timing — they just transform the look.
Text-to-Video
The most open-ended mode. The model receives only text and generates an entire video from scratch.
Strengths:
- Maximum creative freedom — explore concepts quickly
- No input media needed — describe and generate
- Good for brainstorming and mood exploration
Weaknesses:
- Least controllable — hard to get specific compositions, identities, or exact motions
- Most prone to consistency failures — no visual anchor to hold things stable
- Results can be highly variable across generations
Image-to-Video
The model receives a starting image and animates it based on a text prompt describing the desired motion.
Strengths:
- Strong visual anchoring — the first frame is exactly what you specified
- Better identity consistency — the face/character from the image persists
- More predictable composition and style
- Can use AI-generated images, photographs, or artwork as the starting frame
Weaknesses:
- Motion is constrained by the starting pose — some transitions are unnatural
- The model must "invent" motion from a static image, which can look stiff
- Complex motions may break the starting image's consistency
Video-to-Video
The model receives an existing video and transforms its style, content, or quality.
Strengths:
- Motion is already defined — the model just changes the look
- Very high temporal consistency because the timing and motion come from real footage
- Excellent for style transfer on video (live-action → animation, etc.)
Weaknesses:
- Requires existing video input
- Limited to motions that already exist in the source video
- High transformation strength can break consistency
Mode Input Control Level Best For ────────────────────────────────────────────────────────────── Text-to-Video Text only Low Brainstorming, concepts Image-to-Video Image + Text Medium-High Production shots, anchored Video-to-Video Video + Text High Style transfer, enhancement
Key Takeaways
- Text-to-video: maximum freedom, minimum control. Best for exploration.
- Image-to-video: visual anchoring from a starting frame. Best production balance.
- Video-to-video: motion from existing footage, only style changes. Best for transformation.
- Image-to-video is typically the professional sweet spot — generate a perfect frame, then animate it.