피파 한 줄 정리: 픽셀 1024×1024×3은 3M values야. 모델이 그걸 직접 만들면 너무 비싸고 의미도 안 잡혀. 그래서 VAE로 65K짜리 latent로 압축한 다음 거기서 작업해 — 48배 효율.
Here's an obvious question: if we want to generate images, why not just have the model output pixel values directly — red, green, blue for every dot on the screen? The answer is the same reason you don't write a novel letter-by-letter without thinking about words and sentences first: the raw representation is too low-level, too redundant, and too expensive to work with.
Let's make this concrete with numbers.
The Scale Problem
A 1024×1024 image has 1,048,576 pixels. Each pixel has 3 color channels (RGB). That's 3,145,728 individual values the model would need to generate — and they're not independent. Adjacent pixels are usually similar (that's why images look smooth, not random). There's massive redundancy.
Raw Pixel Space: Latent Space:
1024 × 1024 × 3 = 3,145,728 128 × 128 × 4 = 65,536 values
values to generate (48× smaller!)
🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩... ◼️◼️◼️◼️
🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩... ◼️◼️◼️◼️
🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩... ◼️◼️◼️◼️
(massive, redundant) (compact, meaningful)
Enter Latent Space
Instead of working with raw pixels, modern models use a Variational Autoencoder (VAE) to compress images into a much smaller "latent" representation. Think of it as a ZIP file for images, but smarter — it doesn't just compress data, it compresses meaning.
Here's how it works:
- Encoder: Takes a full-resolution image (say 512×512×3) and compresses it to a small latent tensor (say 64×64×4). This captures the image's essential structure — shapes, colors, composition, style — while discarding redundant pixel-level detail.
- Diffusion model: Operates entirely in this compact latent space. All the denoising, generation, and conditioning happen here.
- Decoder: Takes the final clean latent and "unzips" it back to full-resolution pixels.
Training: Image ──▶ [Encoder] ──▶ Latent ──▶ [Decoder] ──▶ Reconstructed Image
📦 compress 📦 expand
Generation: Noise ──▶ [Diffusion in Latent Space] ──▶ Clean Latent ──▶ [Decoder] ──▶ Image
🎲 🌀 denoise 📦 expand 🖼️
Why This Is Brilliant
Working in latent space gives three huge advantages:
- Speed: Operating on 65K values instead of 3M means ~48× less computation per step. This is why you can generate images in seconds instead of hours.
- Meaning: Latent representations capture semantic structure (object shape, composition, style) rather than pixel-level noise. This means the diffusion model can make "meaningful" changes at each step rather than adjusting individual dots.
- Quality: By separating "understanding image structure" (the VAE's job) from "generating image content" (the diffusion model's job), each component can be optimized independently.
- Models work in compressed "latent space" instead of raw pixels — it's ~48× more efficient.
- A VAE encoder compresses images into latents; a VAE decoder expands latents back to pixels.
- Latent space captures meaningful image structure, not redundant pixel values.
- This architecture (called Latent Diffusion) is the foundation of Stable Diffusion, FLUX, and most modern image generators.