Why Character Consistency Is Difficult

피파 한 줄 정리: 같은 prompt → 다른 noise → 다른 image. 모델은 *카테고리*를 그려, *개인*을 안 그려. Character consistency는 reference image로만 진짜 가능해.

Mental model: Imagine you hire ten different street portrait artists to each draw the same person from your verbal description alone: "A woman in her 30s, red curly hair, green eyes, freckles, wearing a denim jacket." You'd get ten recognizably similar but noticeably different women. Each artist interprets "red curly hair" differently, places the freckles differently, draws the jawline differently. None of them are wrong — they're all valid interpretations of your description.

That's exactly what happens when you generate multiple images from the same prompt. Each generation is a fresh sample from a probability distribution. The model has no "memory" of what it drew last time. It doesn't maintain a character sheet internally. Every generation starts from new random noise and follows a slightly different denoising path.

Why There's No Built-In Persistence

Unlike a human artist who can look at their previous sketch and maintain consistency, a diffusion model treats each generation as an independent event. The only shared input is the text prompt, and as we've seen, text is a lossy compression of visual intent. "Red curly hair" maps to an enormous space of possible red-curly-hair configurations.

Same prompt → Different noise seed → Different result

  Prompt: "A woman with red curly hair, green eyes, denim jacket"

  Seed 42:  👩‍🦰 (round face, tight curls, dark denim)
  Seed 43:  👩‍🦰 (angular face, loose waves, light denim)
  Seed 44:  👩‍🦰 (oval face, medium curls, vintage wash)

  All valid. None identical.

Why This Matters for Creative Work

Character consistency is essential for:

Comics and storyboards: The same character must look identical across panels.
Brand mascots: A company character must be recognizable everywhere.
Video generation: Frame-to-frame identity must hold (more in Track 6).
Marketing campaigns: A generated spokesperson must look consistent across assets.

Without consistency, you don't have a character — you have a category of similar-looking people.

Emerging Solutions

The field has developed several approaches (explored deeply in Track 5):

Reference images: Feeding the model a reference photo of the character anchors identity. Midjourney V7's Omni-Reference (--oref) achieves up to 95% consistency. DALL-E uses Gen_ID within a conversation.
Character sheets: Generate a multi-pose reference grid first, then use it as input for subsequent generations. Leonardo AI reports 92% consistency with this approach.
IP-Adapter and similar: Specialized adapters that inject visual identity into the diffusion process via cross-attention, preserving face and appearance across generations.
Seed locking: Using the same seed produces similar (but not identical) results. It helps but doesn't guarantee consistency, especially with prompt changes.

Key Takeaways

Each generation is an independent sample — the model has zero memory of previous outputs.
Text prompts describe categories of appearances, not specific identities.
True consistency requires visual anchoring: reference images, character sheets, or specialized adapters.
Seed locking helps but is fragile — it's a starting point, not a solution.

Why There's No Built-In Persistence

Why This Matters for Creative Work

Emerging Solutions

External links

Exercise

Progress

댓글 0