피파 한 줄 정리: 학습 = '노이즈 빼기' 한 가지. 생성 = 그 능력을 noise에서 시작해 반복 적용하는 emergent 결과. Text는 cross-attention으로 매 step마다 trajectory를 살짝씩 밀어줘.
You might have a nagging question: if the model only learned to remove noise from existing images, how does it create new images that it's never seen? The answer is one of the most beautiful ideas in modern AI.
Think of it this way: imagine you've studied thousands of jigsaw puzzles — you've seen every type of piece, every type of connection, every type of picture. Now someone gives you a pile of random shapes (noise) and asks you to "complete the puzzle." You can't recreate any specific puzzle from memory, but you know what a valid puzzle looks like, so you can arrange the pieces into a new, plausible picture. The model does the same thing: it knows what valid images look like (from training), and it arranges the noise into a new, plausible configuration.
The Key Insight: The Model Learned the Distribution
During training, the model didn't memorize individual images. It learned the statistical distribution of all images — the space of "what's possible." When it denoises from pure noise, it's navigating through this learned distribution, finding a path from chaos to a plausible image. Different random starting noise leads to different paths, which is why each generation produces a different image.
Different noise → different paths → different images: Noise A ──▶ 🌫️ ──▶ 🌁 ──▶ 🖼️ Landscape at dawn Noise B ──▶ 🌫️ ──▶ 🌁 ──▶ 🖼️ Portrait with warm light Noise C ──▶ 🌫️ ──▶ 🌁 ──▶ 🖼️ Still life with fruit Same model, same training, but different starting points → different journeys through "image space"
Text Conditioning: Steering the Denoising
Without any guidance, the denoising process would produce random valid images. To control what image emerges, the model uses text conditioning. Your text prompt is converted into a numerical representation (via a text encoder), and this representation steers every denoising step.
Without text conditioning:
🎲 Noise ──▶ [Denoise] ──▶ [Denoise] ──▶ ... ──▶ 🖼️ Random valid image
With text conditioning ("a cat on a windowsill"):
🎲 Noise ──▶ [Denoise + 🔤] ──▶ [Denoise + 🔤] ──▶ ... ──▶ 🖼️ Cat on windowsill
↑ ↑
Text guides Text guides
each step each step
At every denoising step, the model asks: "Given this noisy image AND the text 'a cat on a windowsill,' what should a slightly less noisy version look like?" The text doesn't dictate every pixel — it biases the denoising trajectory toward the region of image space that matches your description.
How Text Gets Encoded
Your text prompt goes through a text encoder (models like CLIP or T5) that converts words into vectors — numerical representations that capture meaning. These vectors are fed into the diffusion model at every denoising step via a mechanism called cross-attention, which lets the denoising model "look at" the text while deciding what to generate.
Modern models like FLUX use powerful language models (like Mistral Small) as text encoders, which is why they understand natural language instructions much better than earlier models that used simpler CLIP encoders.
- Denoising works as generation because the model learned the distribution of valid images, not specific images.
- Different starting noise → different denoising paths → different valid images.
- Text conditioning steers every step of denoising toward concepts matching your prompt.
- Text is encoded into vectors and injected via cross-attention — the model "reads" your prompt at every step.