Language Generation vs. Media Generation

피파 한 줄 정리: ChatGPT 같은 텍스트 모델은 token을 한 개씩 왼쪽→오른쪽으로 만들어. 이미지 모델은 그렇게 안 해 — 전체 latent canvas를 동시에 refine하지. 둘이 근본적으로 다른 문제라는 게 핵심이야.

You probably already use ChatGPT or similar tools for text. So here's a natural question: if AI can write essays, why is generating images a fundamentally different problem? The answer lies in what gets generated and how.

Think of it like this: text generation is like building with LEGO bricks — you place one brick (word/token) at a time, left to right. Image generation is like sculpting from clay — you start with a rough shape and refine the entire thing simultaneously. These are radically different processes.

Tokens vs. Pixels vs. Latent Representations

Language Model:     [The] → [cat] → [sat] → [on] → [the] → [mat]
                     One token at a time, left to right →

Image Model:        🌫️ Full noisy canvas → 🌁 Refine everywhere at once → 🖼️ Final image
                     All pixels evolve simultaneously

Language models generate tokens — discrete chunks of text (roughly words or word pieces). There are about 50,000–100,000 possible tokens. The model picks the next most probable token, one at a time. This is called autoregressive generation.

Image models typically don't generate pixel-by-pixel from left to right. A 1024×1024 image has over 3 million values (RGB per pixel). Instead, most modern image models work in latent space — a compressed mathematical representation — and generate the entire image at once through a gradual refinement process called diffusion.

Why the Difference Matters

Aspect	Language AI	Image AI
Unit of output	Token (word piece)	Latent vector → decoded to pixels
Generation style	Sequential (left to right)	Simultaneous (refine entire image)
Output space	~100K possible tokens per step	Continuous pixel values (infinite possibilities)
Evaluation	"Does this sentence make sense?"	"Does this image look real?"
Error visibility	Grammatical errors are obvious	Visual errors can hide in peripheral areas

This difference explains why prompting techniques from text AI don't directly translate. When you prompt a language model, you're setting up the start of a sequential chain. When you prompt an image model, you're setting a destination in latent space — a target the model tries to reach through a refinement process.

Key Takeaways

Language models generate tokens one-at-a-time (autoregressive). Image models refine an entire image simultaneously (diffusion).
Images live in continuous space (infinite possible pixel values), not discrete vocabulary.
Image models work in compressed "latent space," not raw pixels.
Prompting an image model sets a destination, not a starting sequence.

Tokens vs. Pixels vs. Latent Representations

Why the Difference Matters

External links

Exercise

Progress

댓글 0