The Modes: Text-to-Image, Inpainting, Video, and Everything In Between

피파 한 줄 정리: T2I만 있는 줄 알았으면 도구상자가 작은 거야. I2I·inpaint·outpaint·T2V·I2V·V2V·reference·controlled 다 다른 problem을 푸는 다른 도구.

Generative AI isn't one tool — it's a whole toolbox. Each mode takes different inputs and produces different outputs, like different power tools for different jobs. Let's map the full landscape.

Image Generation Modes

Text-to-Image (T2I) — The classic: type text, get an image. Your prompt is the only creative input (besides the model's priors). Best for exploration and ideation when you don't have a starting visual.

Image-to-Image (I2I) — Provide a reference image plus a text prompt. The model uses the image as structural guidance and reinterprets it through the lens of your prompt. Great for style transfer, pose references, and iterative refinement.

Inpainting — Mask a region of an existing image and regenerate just that area. Think Photoshop's content-aware fill but with AI generation power. Essential for fixing localized problems.

Outpainting — Extend the canvas beyond the original image borders. The model generates new content that seamlessly continues the scene. Useful for adjusting aspect ratios or revealing "what's outside the frame."

T2I:   "prompt" ─────────────────────────▶ 🖼️ New image

I2I:   "prompt" + 🖼️ reference ──────────▶ 🖼️ Reinterpreted image

Inpaint: 🖼️ image + 🎭 mask + "prompt" ──▶ 🖼️ Partially edited image

Outpaint: 🖼️ image + ⬜ extended canvas ──▶ 🖼️ Expanded image

Video Generation Modes

Text-to-Video (T2V) — Type text, get a video clip. Models like Sora 2, Veo 3.1, and Kling 3.0 can produce 5-60 second clips. Quality has improved dramatically — as of early 2026, finger artifacts are largely solved and temporal consistency is vastly better. But coherent long-form video remains challenging.

Image-to-Video (I2V) — Provide a still image and animate it. This is often more reliable than pure text-to-video because the model has a clear visual anchor for the first frame — character appearance, scene composition, lighting, and style are already established.

Video-to-Video (V2V) — Transform an existing video: change its style, enhance resolution, modify elements while preserving motion. Useful for re-stylizing footage or applying consistent effects across frames.

Emerging Modes

The landscape keeps expanding:

Reference-based generation: Providing multiple reference images for identity, style, or composition guidance
Subject-driven generation: Training or conditioning on specific subjects (a particular person, product, or character)
Controlled generation: Using depth maps, edge maps, pose skeletons, or segmentation masks to precisely control composition
Audio-native video: Models like Veo 3.1 and Seedance 2.0 that generate synchronized audio alongside video

Key Takeaways

There are many generation modes: T2I, I2I, inpainting, outpainting, T2V, I2V, V2V, and more.
Each mode takes different inputs and solves different creative problems.
Image-to-video is often more reliable than pure text-to-video because it provides visual anchoring.
Modern models increasingly support multi-modal inputs: text + image + audio + control signals.

The Modes: Text-to-Image, Inpainting, Video, and Everything In Between

Image Generation Modes

Video Generation Modes

Emerging Modes

External links

Exercise

Progress

댓글 0