피파 한 줄 정리: Diffusion의 핵심 한 줄: **noise를 점점 *제거*하는 법을 배우면, 그 능력으로 noise에서 시작해 image를 *생성*할 수 있다**. 이걸 sculptor 비유로 잡아.
Here's the most beautiful idea in modern image generation, and it's surprisingly simple. Imagine you have a photograph. Now imagine slowly adding TV static (random noise) to it, one layer at a time. After enough layers, the photograph is completely destroyed — nothing but pure static remains. Diffusion models learn to run this process in reverse: starting from pure noise, they gradually remove the static, step by step, until a coherent image emerges.
Forward (Training): Add noise progressively 🖼️ ──▶ 🖼️+🌫️ ──▶ 🌫️+🌫️ ──▶ 🌫️🌫️🌫️ ──▶ 🎲 Pure noise Clean Slightly Mostly Very Total image noisy noisy noisy chaos Reverse (Generation): Remove noise progressively 🎲 ──▶ 🌫️🌫️ ──▶ 🌫️+🌁 ──▶ 🌁+🖼️ ──▶ 🖼️ Final image Pure Still Shapes Details Clean, noise chaotic emerge sharpen coherent
That's it. That's the core insight. The model has learned what noise looks like at every stage, and it's been trained to predict what the slightly-less-noisy version should look like. By applying this prediction repeatedly, it peels away layers of noise to reveal an image.
Why Starting From Noise Works
This seems backward — why would you start with garbage? Because noise is maximally unpredictable. It contains no bias, no preexisting structure, no constraints. Starting from noise means the model has complete freedom to create any image. The noise is like a blank canvas made of randomness — and the denoising process is the painting.
The Sculptor Analogy
Think of it like a sculptor working with marble. The sculptor doesn't add the statue — the statue is "already inside" the marble block, and the sculptor removes material to reveal it. Similarly, the image is "already inside" the noise (mathematically, any image can be reached from any noise sample), and the diffusion model removes noise to reveal it.
Here's what each phase looks like conceptually:
- Early steps (high noise): The model makes big decisions — overall composition, rough layout, major shapes, color scheme. Like a sculptor blocking out the general form.
- Middle steps (medium noise): Structure solidifies — faces take shape, objects become recognizable, spatial relationships lock in. The sculptor defines limbs, posture, proportions.
- Late steps (low noise): Fine details emerge — skin texture, fabric weave, lighting subtleties, sharp edges. The sculptor polishes and adds detail.
Step 1-5: 🟫🟫🟫 "Big decisions" — layout, composition, major shapes Step 6-15: 🗿 "Structure" — objects form, faces emerge Step 16-30: 🏛️ "Refinement" — details, textures, lighting Step 30-50: 🖼️ "Polish" — final crisp details
- Diffusion = start with noise, gradually remove it to reveal an image.
- The model learned to reverse a noise-adding process: predict "what does a slightly cleaner version look like?"
- Early steps decide composition and layout; late steps add fine detail.
- Starting from noise gives the model maximum creative freedom — any image is reachable.