C.W.K.
Stream
Lesson 02 of 10 · published

Language Generation vs. Media Generation

~15 min · foundations, mental-model, l2

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: ChatGPT 같은 텍스트 모델은 token을 한 개씩 왼쪽→오른쪽으로 만들어. 이미지 모델은 그렇게 안 해 — 전체 latent canvas를 동시에 refine하지. 둘이 근본적으로 다른 문제라는 게 핵심이야.

You probably already use ChatGPT or similar tools for text. So here's a natural question: if AI can write essays, why is generating images a fundamentally different problem? The answer lies in what gets generated and how.

Think of it like this: text generation is like building with LEGO bricks — you place one brick (word/token) at a time, left to right. Image generation is like sculpting from clay — you start with a rough shape and refine the entire thing simultaneously. These are radically different processes.

Tokens vs. Pixels vs. Latent Representations

Language Model:     [The] → [cat] → [sat] → [on] → [the] → [mat]
                     One token at a time, left to right →

Image Model:        🌫️ Full noisy canvas → 🌁 Refine everywhere at once → 🖼️ Final image
                     All pixels evolve simultaneously

Language models generate tokens — discrete chunks of text (roughly words or word pieces). There are about 50,000–100,000 possible tokens. The model picks the next most probable token, one at a time. This is called autoregressive generation.

Image models typically don't generate pixel-by-pixel from left to right. A 1024×1024 image has over 3 million values (RGB per pixel). Instead, most modern image models work in latent space — a compressed mathematical representation — and generate the entire image at once through a gradual refinement process called diffusion.

Why the Difference Matters

AspectLanguage AIImage AI
Unit of outputToken (word piece)Latent vector → decoded to pixels
Generation styleSequential (left to right)Simultaneous (refine entire image)
Output space~100K possible tokens per stepContinuous pixel values (infinite possibilities)
Evaluation"Does this sentence make sense?""Does this image look real?"
Error visibilityGrammatical errors are obviousVisual errors can hide in peripheral areas

This difference explains why prompting techniques from text AI don't directly translate. When you prompt a language model, you're setting up the start of a sequential chain. When you prompt an image model, you're setting a destination in latent space — a target the model tries to reach through a refinement process.

Key Takeaways
  • Language models generate tokens one-at-a-time (autoregressive). Image models refine an entire image simultaneously (diffusion).
  • Images live in continuous space (infinite possible pixel values), not discrete vocabulary.
  • Image models work in compressed "latent space," not raw pixels.
  • Prompting an image model sets a destination, not a starting sequence.

External links

Exercise

같은 prompt를 텍스트 모델 (Claude/ChatGPT)과 image 모델에 던져. 비교: text 모델이 하는 걸 image 모델은 못 하고, 그 반대도. 실제 본 차이를 200자로 써.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.