피파 한 줄 정리: T2I만 있는 줄 알았으면 도구상자가 작은 거야. I2I·inpaint·outpaint·T2V·I2V·V2V·reference·controlled 다 다른 problem을 푸는 다른 도구.
Generative AI isn't one tool — it's a whole toolbox. Each mode takes different inputs and produces different outputs, like different power tools for different jobs. Let's map the full landscape.
Image Generation Modes
Text-to-Image (T2I) — The classic: type text, get an image. Your prompt is the only creative input (besides the model's priors). Best for exploration and ideation when you don't have a starting visual.
Image-to-Image (I2I) — Provide a reference image plus a text prompt. The model uses the image as structural guidance and reinterprets it through the lens of your prompt. Great for style transfer, pose references, and iterative refinement.
Inpainting — Mask a region of an existing image and regenerate just that area. Think Photoshop's content-aware fill but with AI generation power. Essential for fixing localized problems.
Outpainting — Extend the canvas beyond the original image borders. The model generates new content that seamlessly continues the scene. Useful for adjusting aspect ratios or revealing "what's outside the frame."
T2I: "prompt" ─────────────────────────▶ 🖼️ New image I2I: "prompt" + 🖼️ reference ──────────▶ 🖼️ Reinterpreted image Inpaint: 🖼️ image + 🎭 mask + "prompt" ──▶ 🖼️ Partially edited image Outpaint: 🖼️ image + ⬜ extended canvas ──▶ 🖼️ Expanded image
Video Generation Modes
Text-to-Video (T2V) — Type text, get a video clip. Models like Sora 2, Veo 3.1, and Kling 3.0 can produce 5-60 second clips. Quality has improved dramatically — as of early 2026, finger artifacts are largely solved and temporal consistency is vastly better. But coherent long-form video remains challenging.
Image-to-Video (I2V) — Provide a still image and animate it. This is often more reliable than pure text-to-video because the model has a clear visual anchor for the first frame — character appearance, scene composition, lighting, and style are already established.
Video-to-Video (V2V) — Transform an existing video: change its style, enhance resolution, modify elements while preserving motion. Useful for re-stylizing footage or applying consistent effects across frames.
Emerging Modes
The landscape keeps expanding:
- Reference-based generation: Providing multiple reference images for identity, style, or composition guidance
- Subject-driven generation: Training or conditioning on specific subjects (a particular person, product, or character)
- Controlled generation: Using depth maps, edge maps, pose skeletons, or segmentation masks to precisely control composition
- Audio-native video: Models like Veo 3.1 and Seedance 2.0 that generate synchronized audio alongside video
- There are many generation modes: T2I, I2I, inpainting, outpainting, T2V, I2V, V2V, and more.
- Each mode takes different inputs and solves different creative problems.
- Image-to-video is often more reliable than pure text-to-video because it provides visual anchoring.
- Modern models increasingly support multi-modal inputs: text + image + audio + control signals.