Diffusion Intuition: From Noise to Image

피파 한 줄 정리: Diffusion의 핵심 한 줄: **noise를 점점 *제거*하는 법을 배우면, 그 능력으로 noise에서 시작해 image를 *생성*할 수 있다**. 이걸 sculptor 비유로 잡아.

Here's the most beautiful idea in modern image generation, and it's surprisingly simple. Imagine you have a photograph. Now imagine slowly adding TV static (random noise) to it, one layer at a time. After enough layers, the photograph is completely destroyed — nothing but pure static remains. Diffusion models learn to run this process in reverse: starting from pure noise, they gradually remove the static, step by step, until a coherent image emerges.

Forward (Training): Add noise progressively

  🖼️ ──▶ 🖼️+🌫️ ──▶ 🌫️+🌫️ ──▶ 🌫️🌫️🌫️ ──▶ 🎲 Pure noise
  Clean     Slightly     Mostly      Very         Total
  image     noisy        noisy       noisy        chaos

Reverse (Generation): Remove noise progressively

  🎲 ──▶ 🌫️🌫️ ──▶ 🌫️+🌁 ──▶ 🌁+🖼️ ──▶ 🖼️ Final image
  Pure     Still       Shapes     Details    Clean,
  noise    chaotic     emerge     sharpen    coherent

That's it. That's the core insight. The model has learned what noise looks like at every stage, and it's been trained to predict what the slightly-less-noisy version should look like. By applying this prediction repeatedly, it peels away layers of noise to reveal an image.

Why Starting From Noise Works

This seems backward — why would you start with garbage? Because noise is maximally unpredictable. It contains no bias, no preexisting structure, no constraints. Starting from noise means the model has complete freedom to create any image. The noise is like a blank canvas made of randomness — and the denoising process is the painting.

The Sculptor Analogy

Think of it like a sculptor working with marble. The sculptor doesn't add the statue — the statue is "already inside" the marble block, and the sculptor removes material to reveal it. Similarly, the image is "already inside" the noise (mathematically, any image can be reached from any noise sample), and the diffusion model removes noise to reveal it.

Here's what each phase looks like conceptually:

Early steps (high noise): The model makes big decisions — overall composition, rough layout, major shapes, color scheme. Like a sculptor blocking out the general form.
Middle steps (medium noise): Structure solidifies — faces take shape, objects become recognizable, spatial relationships lock in. The sculptor defines limbs, posture, proportions.
Late steps (low noise): Fine details emerge — skin texture, fabric weave, lighting subtleties, sharp edges. The sculptor polishes and adds detail.

Step 1-5:     🟫🟫🟫  "Big decisions" — layout, composition, major shapes
Step 6-15:    🗿       "Structure" — objects form, faces emerge
Step 16-30:   🏛️       "Refinement" — details, textures, lighting
Step 30-50:   🖼️       "Polish" — final crisp details

Key Takeaways

Diffusion = start with noise, gradually remove it to reveal an image.
The model learned to reverse a noise-adding process: predict "what does a slightly cleaner version look like?"
Early steps decide composition and layout; late steps add fine detail.
Starting from noise gives the model maximum creative freedom — any image is reachable.

Why Starting From Noise Works

The Sculptor Analogy

External links

Exercise

Progress

댓글 0