C.W.K.
Stream
Lesson 08 of 10 · published

The Modes: Text-to-Image, Inpainting, Video, and Everything In Between

~14 min · foundations, mental-model, l8

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: T2I만 있는 줄 알았으면 도구상자가 작은 거야. I2I·inpaint·outpaint·T2V·I2V·V2V·reference·controlled 다 다른 problem을 푸는 다른 도구.

Generative AI isn't one tool — it's a whole toolbox. Each mode takes different inputs and produces different outputs, like different power tools for different jobs. Let's map the full landscape.

Image Generation Modes

Text-to-Image (T2I) — The classic: type text, get an image. Your prompt is the only creative input (besides the model's priors). Best for exploration and ideation when you don't have a starting visual.

Image-to-Image (I2I) — Provide a reference image plus a text prompt. The model uses the image as structural guidance and reinterprets it through the lens of your prompt. Great for style transfer, pose references, and iterative refinement.

Inpainting — Mask a region of an existing image and regenerate just that area. Think Photoshop's content-aware fill but with AI generation power. Essential for fixing localized problems.

Outpainting — Extend the canvas beyond the original image borders. The model generates new content that seamlessly continues the scene. Useful for adjusting aspect ratios or revealing "what's outside the frame."

T2I:   "prompt" ─────────────────────────▶ 🖼️ New image

I2I:   "prompt" + 🖼️ reference ──────────▶ 🖼️ Reinterpreted image

Inpaint: 🖼️ image + 🎭 mask + "prompt" ──▶ 🖼️ Partially edited image

Outpaint: 🖼️ image + ⬜ extended canvas ──▶ 🖼️ Expanded image

Video Generation Modes

Text-to-Video (T2V) — Type text, get a video clip. Models like Sora 2, Veo 3.1, and Kling 3.0 can produce 5-60 second clips. Quality has improved dramatically — as of early 2026, finger artifacts are largely solved and temporal consistency is vastly better. But coherent long-form video remains challenging.

Image-to-Video (I2V) — Provide a still image and animate it. This is often more reliable than pure text-to-video because the model has a clear visual anchor for the first frame — character appearance, scene composition, lighting, and style are already established.

Video-to-Video (V2V) — Transform an existing video: change its style, enhance resolution, modify elements while preserving motion. Useful for re-stylizing footage or applying consistent effects across frames.

Emerging Modes

The landscape keeps expanding:

  • Reference-based generation: Providing multiple reference images for identity, style, or composition guidance
  • Subject-driven generation: Training or conditioning on specific subjects (a particular person, product, or character)
  • Controlled generation: Using depth maps, edge maps, pose skeletons, or segmentation masks to precisely control composition
  • Audio-native video: Models like Veo 3.1 and Seedance 2.0 that generate synchronized audio alongside video
Key Takeaways
  • There are many generation modes: T2I, I2I, inpainting, outpainting, T2V, I2V, V2V, and more.
  • Each mode takes different inputs and solves different creative problems.
  • Image-to-video is often more reliable than pure text-to-video because it provides visual anchoring.
  • Modern models increasingly support multi-modal inputs: text + image + audio + control signals.

External links

Exercise

최근 본 video 한 개 골라. 오늘 그걸 generate한다면 어떤 mode로 (T2V·I2V·V2V) 시작? 첫 shot은? 왜?

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.