C.W.K.
Stream
Lesson 01 of 10 · published

Why Models Don't Generate Pixels Directly

~12 min · diffusion, latent-space, l1

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: 픽셀 1024×1024×3은 3M values야. 모델이 그걸 직접 만들면 너무 비싸고 의미도 안 잡혀. 그래서 VAE로 65K짜리 latent로 압축한 다음 거기서 작업해 — 48배 효율.

Here's an obvious question: if we want to generate images, why not just have the model output pixel values directly — red, green, blue for every dot on the screen? The answer is the same reason you don't write a novel letter-by-letter without thinking about words and sentences first: the raw representation is too low-level, too redundant, and too expensive to work with.

Let's make this concrete with numbers.

The Scale Problem

A 1024×1024 image has 1,048,576 pixels. Each pixel has 3 color channels (RGB). That's 3,145,728 individual values the model would need to generate — and they're not independent. Adjacent pixels are usually similar (that's why images look smooth, not random). There's massive redundancy.

Raw Pixel Space:                    Latent Space:

1024 × 1024 × 3 = 3,145,728       128 × 128 × 4 = 65,536 values
values to generate                  (48× smaller!)

🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩...          ◼️◼️◼️◼️
🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩...          ◼️◼️◼️◼️
🟥🟧🟨🟩🟦🟪🟥🟧🟨🟩...          ◼️◼️◼️◼️
    (massive, redundant)               (compact, meaningful)

Enter Latent Space

Instead of working with raw pixels, modern models use a Variational Autoencoder (VAE) to compress images into a much smaller "latent" representation. Think of it as a ZIP file for images, but smarter — it doesn't just compress data, it compresses meaning.

Here's how it works:

  1. Encoder: Takes a full-resolution image (say 512×512×3) and compresses it to a small latent tensor (say 64×64×4). This captures the image's essential structure — shapes, colors, composition, style — while discarding redundant pixel-level detail.
  2. Diffusion model: Operates entirely in this compact latent space. All the denoising, generation, and conditioning happen here.
  3. Decoder: Takes the final clean latent and "unzips" it back to full-resolution pixels.
Training: Image ──▶ [Encoder] ──▶ Latent ──▶ [Decoder] ──▶ Reconstructed Image
                      📦 compress     📦 expand

Generation: Noise ──▶ [Diffusion in Latent Space] ──▶ Clean Latent ──▶ [Decoder] ──▶ Image
             🎲              🌀 denoise                    📦 expand        🖼️

Why This Is Brilliant

Working in latent space gives three huge advantages:

  • Speed: Operating on 65K values instead of 3M means ~48× less computation per step. This is why you can generate images in seconds instead of hours.
  • Meaning: Latent representations capture semantic structure (object shape, composition, style) rather than pixel-level noise. This means the diffusion model can make "meaningful" changes at each step rather than adjusting individual dots.
  • Quality: By separating "understanding image structure" (the VAE's job) from "generating image content" (the diffusion model's job), each component can be optimized independently.
Key Takeaways
  • Models work in compressed "latent space" instead of raw pixels — it's ~48× more efficient.
  • A VAE encoder compresses images into latents; a VAE decoder expands latents back to pixels.
  • Latent space captures meaningful image structure, not redundant pixel values.
  • This architecture (called Latent Diffusion) is the foundation of Stable Diffusion, FLUX, and most modern image generators.

External links

Exercise

SD 1.5·SDXL·FLUX의 VAE 압축비 검색. 각각 raw 1024×1024 RGB 대비 얼마나 적은 값을 다루는지 계산. 100자로 inference 속도에 미치는 영향 정리.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.