Diffusion vs. Autoregressive vs. Flow-Based Approaches

피파 한 줄 정리: Diffusion → Flow matching으로 이행 중. FLUX와 SD 3.5가 flow matching 기반. User 입장에서 거의 차이 없지만, 더 효율적이고 더 stable해.

Diffusion isn't the only way to generate images — it's just the one that currently dominates. Let's zoom out and see the landscape of approaches, so you understand why diffusion won and what alternatives exist.

Think of three different ways to create a painting:

Diffusion: Start with a canvas of random splatter, then carefully clean and refine the whole canvas simultaneously until a painting emerges. (Sculptor approach)
Autoregressive: Paint one pixel (or patch) at a time, left-to-right, top-to-bottom, each one informed by all previous pixels. (Typewriter approach)
Flow-based: Smoothly morph a random blob into the final painting through a continuous, learned transformation — like a time-lapse of a painting appearing in one smooth motion. (Morphing approach)

Diffusion Models

How: Learn to reverse a noise-adding process. Generate by iteratively denoising from random noise.

Examples: Stable Diffusion 1.5/XL/3.5, DALL-E 3, Imagen 3

Strengths: High quality, good diversity, well-understood training, strong ecosystem of tools (ControlNet, LoRA, inpainting)

Weaknesses: Slow (many denoising steps), architecture choices affect quality ceiling

Autoregressive Models

How: Predict the next image token (or patch) based on all previous ones, similar to how language models predict the next word.

Examples: DALL-E 1 & 2 (partially), Parti, some aspects of newer multimodal models

Strengths: Natural fit for combined text+image generation, can leverage scaling insights from language models

Weaknesses: Sequential generation is slow, can accumulate errors, historically lower image quality than diffusion

Flow Matching / Flow-Based Models

How: Learn a continuous transformation (flow) from a simple noise distribution to the image distribution. Instead of discrete denoising steps, the model learns a smooth, direct path from noise to image.

Examples: FLUX, Stable Diffusion 3.5 (MMDiT architecture uses flow matching principles)

Strengths: More efficient sampling, cleaner theoretical foundation, better training stability, can produce high-quality results in fewer steps

Weaknesses: Newer approach, some techniques from the diffusion ecosystem (like standard CFG) needed adaptation

Why Diffusion (and Flow Matching) Won

The short answer: quality and scalability. Diffusion-family approaches produce the best images, scale well with more compute and data, and have a rich ecosystem of control tools. Autoregressive approaches had their moment but couldn't match diffusion quality for standalone image generation (though they're making a comeback in multimodal/video contexts).

Image Generation Approach Timeline:

2021-2022: GANs dominant → Diffusion overtakes
2022-2023: Diffusion (U-Net) dominates → Stable Diffusion era
2023-2024: Diffusion Transformers (DiT) emerge → DALL-E 3, SD 3
2024-2025: Flow matching + DiT → FLUX, SD 3.5
2025-2026: Flow matching matures → FLUX.2, continued evolution

Key Takeaways

Diffusion (noise→image through denoising), autoregressive (pixel-by-pixel), and flow matching (smooth transformation) are the three main approaches.
Diffusion dominated 2022-2024; flow matching (its successor) is taking over in 2024-2026.
FLUX and SD 3.5 use flow matching — think of it as "diffusion 2.0" with faster, more stable generation.
For users, the practical behavior is similar across diffusion and flow matching — the differences matter more for researchers.

Diffusion vs. Autoregressive vs. Flow-Based Approaches

Diffusion Models

Autoregressive Models

Flow Matching / Flow-Based Models

Why Diffusion (and Flow Matching) Won

External links

Exercise

Progress

댓글 0