Guidance: Prompt Adherence vs. Natural Outputs

피파 한 줄 정리: Guidance scale (CFG)은 'prompt 충실도 vs 자연스러움'의 슬라이더. 5-8이 sweet spot, 15+는 burnt 영역.

There's a critical slider in most image generation tools called guidance scale (or "CFG scale" — Classifier-Free Guidance). It controls how strongly the model follows your text prompt vs. how much creative freedom it has. Understanding this single parameter will save you hours of frustration.

Think of it like giving directions to a painter: low guidance = "Paint me something with a sunset feel" (loose, interpretive, creative). High guidance = "Paint exactly this: a sunset with three clouds, orange and pink, over a lake, with two boats" (rigid, literal, constrained). Both can produce great art, but they fail differently.

How Classifier-Free Guidance Works

Under the hood, CFG works by running the denoising step twice:

Unconditional prediction: "What does a less noisy version look like?" (no text input)
Conditional prediction: "What does a less noisy version look like, given this text?"

The final output is calculated as:

Output = Unconditional + Guidance_Scale × (Conditional − Unconditional)

When scale = 1:  Output = Conditional  (just follow the text, no boost)
When scale = 7:  Output = Unconditional + 7 × (difference)  (strongly amplify text influence)
When scale = 20: Output = Unconditional + 20 × (difference) (VERY aggressively follow text)

The guidance scale amplifies the difference between what the model would produce with and without your text. Higher values = more aggressively push toward text-matching results.

The Guidance Scale Spectrum

Scale:  1     3     5     7     10    15    20    30
        ┃     ┃     ┃     ┃     ┃     ┃     ┃     ┃
       🎨    🎨    🖼️    🖼️    📐    📐    💀    💀
       Very  Creative Good  Sweet  Tight  Over-  Burnt  Broken
       loose          range spot         done

       ← More creative/diverse        More literal/rigid →
       ← More natural                 More saturated/contrasty →
       ← May ignore parts of prompt   May look artificial →

1-3: Very loose. Model follows the general vibe but takes many creative liberties. Good for artistic exploration when you want surprises.
5-8: The sweet spot for most use cases. Good prompt adherence while maintaining natural-looking outputs. Most models default somewhere in this range.
10-15: Strong adherence but images start looking "pushed" — oversaturated colors, exaggerated contrast, loss of subtlety.
15+: Danger zone. Images become brittle — harsh edges, burned colors, artifact-prone. The model is being forced too hard in one direction.

Key Takeaways

Guidance scale controls how strongly the model follows your prompt vs. how naturally it generates.
The sweet spot is usually 5-8. Below that is loose/creative; above 10-12 risks artifacts.
"Crank up guidance" is not the fix for bad prompt adherence — it usually makes things worse.
High guidance amplifies conflicts in your prompt, making complex prompts more fragile.

How Classifier-Free Guidance Works

The Guidance Scale Spectrum

External links

Exercise

Progress

댓글 0