Text-to-Video, Image-to-Video, Video-to-Video

피파 한 줄 정리: T2V (탐색·자유), I2V (production·anchor), V2V (style transfer·motion 보존). 셋이 다른 problem을 풀어.

Mental model: These three modes are like three different levels of guidance you can give a film crew:

Text-to-video: "Film a sunset over the ocean with gentle waves." The crew interprets everything — location, camera angle, timing, mood. Maximum creative freedom, minimum control.
Image-to-video: "Here's a still photo of this exact beach at this angle. Now animate it." The crew has a visual starting point — composition and mood are locked. They just add motion.
Video-to-video: "Here's a rough clip I shot on my phone. Make it look like a cinematic film." The crew has the full motion and timing — they just transform the look.

Text-to-Video

The most open-ended mode. The model receives only text and generates an entire video from scratch.

Strengths:

Maximum creative freedom — explore concepts quickly
No input media needed — describe and generate
Good for brainstorming and mood exploration

Weaknesses:

Least controllable — hard to get specific compositions, identities, or exact motions
Most prone to consistency failures — no visual anchor to hold things stable
Results can be highly variable across generations

Image-to-Video

The model receives a starting image and animates it based on a text prompt describing the desired motion.

Strengths:

Strong visual anchoring — the first frame is exactly what you specified
Better identity consistency — the face/character from the image persists
More predictable composition and style
Can use AI-generated images, photographs, or artwork as the starting frame

Weaknesses:

Motion is constrained by the starting pose — some transitions are unnatural
The model must "invent" motion from a static image, which can look stiff
Complex motions may break the starting image's consistency

Video-to-Video

The model receives an existing video and transforms its style, content, or quality.

Strengths:

Motion is already defined — the model just changes the look
Very high temporal consistency because the timing and motion come from real footage
Excellent for style transfer on video (live-action → animation, etc.)

Weaknesses:

Requires existing video input
Limited to motions that already exist in the source video
High transformation strength can break consistency

Mode            Input            Control Level    Best For
  ──────────────────────────────────────────────────────────────
  Text-to-Video   Text only        Low              Brainstorming, concepts
  Image-to-Video  Image + Text     Medium-High      Production shots, anchored
  Video-to-Video  Video + Text     High             Style transfer, enhancement

Key Takeaways

Text-to-video: maximum freedom, minimum control. Best for exploration.
Image-to-video: visual anchoring from a starting frame. Best production balance.
Video-to-video: motion from existing footage, only style changes. Best for transformation.
Image-to-video is typically the professional sweet spot — generate a perfect frame, then animate it.

Text-to-Video

Image-to-Video

Video-to-Video

External links

Exercise

Progress

댓글 0