Why Models Don't Generate Pixels Directly

피파 한 줄 정리: 픽셀 1024×1024×3은 3M values야. 모델이 그걸 직접 만들면 너무 비싸고 의미도 안 잡혀. 그래서 VAE로 65K짜리 latent로 압축한 다음 거기서 작업해 — 48배 효율.

Here's an obvious question: if we want to generate images, why not just have the model output pixel values directly — red, green, blue for every dot on the screen? The answer is the same reason you don't write a novel letter-by-letter without thinking about words and sentences first: the raw representation is too low-level, too redundant, and too expensive to work with.

Let's make this concrete with numbers.

The Scale Problem

A 1024×1024 image has 1,048,576 pixels. Each pixel has 3 color channels (RGB). That's 3,145,728 individual values the model would need to generate — and they're not independent. Adjacent pixels are usually similar (that's why images look smooth, not random). There's massive redundancy.

Raw Pixel Space:                    Latent Space:

1024 × 1024 × 3 = 3,145,728       128 × 128 × 4 = 65,536 values
values to generate                  (48× smaller!)

🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩...          ◼️◼️◼️◼️
🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩...          ◼️◼️◼️◼️
🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩...          ◼️◼️◼️◼️
    (massive, redundant)               (compact, meaningful)

Enter Latent Space

Instead of working with raw pixels, modern models use a Variational Autoencoder (VAE) to compress images into a much smaller "latent" representation. Think of it as a ZIP file for images, but smarter — it doesn't just compress data, it compresses meaning.

Here's how it works:

Encoder: Takes a full-resolution image (say 512×512×3) and compresses it to a small latent tensor (say 64×64×4). This captures the image's essential structure — shapes, colors, composition, style — while discarding redundant pixel-level detail.
Diffusion model: Operates entirely in this compact latent space. All the denoising, generation, and conditioning happen here.
Decoder: Takes the final clean latent and "unzips" it back to full-resolution pixels.

Training: Image ──▶ [Encoder] ──▶ Latent ──▶ [Decoder] ──▶ Reconstructed Image
                      📦 compress     📦 expand

Generation: Noise ──▶ [Diffusion in Latent Space] ──▶ Clean Latent ──▶ [Decoder] ──▶ Image
             🎲              🌀 denoise                    📦 expand        🖼️

Why This Is Brilliant

Working in latent space gives three huge advantages:

Speed: Operating on 65K values instead of 3M means ~48× less computation per step. This is why you can generate images in seconds instead of hours.
Meaning: Latent representations capture semantic structure (object shape, composition, style) rather than pixel-level noise. This means the diffusion model can make "meaningful" changes at each step rather than adjusting individual dots.
Quality: By separating "understanding image structure" (the VAE's job) from "generating image content" (the diffusion model's job), each component can be optimized independently.

Key Takeaways

Models work in compressed "latent space" instead of raw pixels — it's ~48× more efficient.
A VAE encoder compresses images into latents; a VAE decoder expands latents back to pixels.
Latent space captures meaningful image structure, not redundant pixel values.
This architecture (called Latent Diffusion) is the foundation of Stable Diffusion, FLUX, and most modern image generators.

The Scale Problem

Enter Latent Space

Why This Is Brilliant

External links

Exercise

Progress

댓글 0