피파 한 줄 정리: Diffusion → Flow matching으로 이행 중. FLUX와 SD 3.5가 flow matching 기반. User 입장에서 거의 차이 없지만, 더 효율적이고 더 stable해.
Diffusion isn't the only way to generate images — it's just the one that currently dominates. Let's zoom out and see the landscape of approaches, so you understand why diffusion won and what alternatives exist.
Think of three different ways to create a painting:
- Diffusion: Start with a canvas of random splatter, then carefully clean and refine the whole canvas simultaneously until a painting emerges. (Sculptor approach)
- Autoregressive: Paint one pixel (or patch) at a time, left-to-right, top-to-bottom, each one informed by all previous pixels. (Typewriter approach)
- Flow-based: Smoothly morph a random blob into the final painting through a continuous, learned transformation — like a time-lapse of a painting appearing in one smooth motion. (Morphing approach)
Diffusion Models
How: Learn to reverse a noise-adding process. Generate by iteratively denoising from random noise.
Examples: Stable Diffusion 1.5/XL/3.5, DALL-E 3, Imagen 3
Strengths: High quality, good diversity, well-understood training, strong ecosystem of tools (ControlNet, LoRA, inpainting)
Weaknesses: Slow (many denoising steps), architecture choices affect quality ceiling
Autoregressive Models
How: Predict the next image token (or patch) based on all previous ones, similar to how language models predict the next word.
Examples: DALL-E 1 & 2 (partially), Parti, some aspects of newer multimodal models
Strengths: Natural fit for combined text+image generation, can leverage scaling insights from language models
Weaknesses: Sequential generation is slow, can accumulate errors, historically lower image quality than diffusion
Flow Matching / Flow-Based Models
How: Learn a continuous transformation (flow) from a simple noise distribution to the image distribution. Instead of discrete denoising steps, the model learns a smooth, direct path from noise to image.
Examples: FLUX, Stable Diffusion 3.5 (MMDiT architecture uses flow matching principles)
Strengths: More efficient sampling, cleaner theoretical foundation, better training stability, can produce high-quality results in fewer steps
Weaknesses: Newer approach, some techniques from the diffusion ecosystem (like standard CFG) needed adaptation
Why Diffusion (and Flow Matching) Won
The short answer: quality and scalability. Diffusion-family approaches produce the best images, scale well with more compute and data, and have a rich ecosystem of control tools. Autoregressive approaches had their moment but couldn't match diffusion quality for standalone image generation (though they're making a comeback in multimodal/video contexts).
Image Generation Approach Timeline: 2021-2022: GANs dominant → Diffusion overtakes 2022-2023: Diffusion (U-Net) dominates → Stable Diffusion era 2023-2024: Diffusion Transformers (DiT) emerge → DALL-E 3, SD 3 2024-2025: Flow matching + DiT → FLUX, SD 3.5 2025-2026: Flow matching matures → FLUX.2, continued evolution
- Diffusion (noise→image through denoising), autoregressive (pixel-by-pixel), and flow matching (smooth transformation) are the three main approaches.
- Diffusion dominated 2022-2024; flow matching (its successor) is taking over in 2024-2026.
- FLUX and SD 3.5 use flow matching — think of it as "diffusion 2.0" with faster, more stable generation.
- For users, the practical behavior is similar across diffusion and flow matching — the differences matter more for researchers.