Reference Images: Visual Anchors for Everything

피파 한 줄 정리: Reference image는 face·pose·style·composition·wardrobe·mood — 어떤 dimension이든 visual anchor를 줘. Tool마다 weight 슬라이더 다루는 법이 다름.

Mental model: A film director doesn't just tell the cinematographer "make it look moody." She shows a stack of reference stills: a frame from Blade Runner for the lighting, a Vermeer painting for the color palette, a fashion editorial for the pose. These references collapse an enormous space of possible interpretations down to a specific neighborhood of visual intent. That's exactly what reference images do for AI generators.

What Reference Images Anchor

A reference image can guide the model on many different dimensions, depending on the tool and technique:

Identity/Face: "Make this person look like the person in this photo" — face anchoring for character consistency.
Pose: "Use this body position" — structural guidance without copying appearance.
Style/Aesthetic: "Make it look like this painting" — transferring color palette, brushwork, mood.
Composition: "Arrange elements like this photo" — preserving layout while changing content.
Wardrobe/Props: "This character should wear this outfit" — product or costume reference.
Mood/Lighting: "Light it like this scene" — atmospheric guidance.

How Reference Systems Work (High Level)

Different platforms implement references differently, but the core idea is the same: the reference image is encoded into the model's latent space (usually via a vision encoder like CLIP or a specialized adapter), and its features are injected into the generation process alongside the text prompt.

┌───────────────┐     ┌───────────────┐
  │  Text Prompt   │     │Reference Image│
  │ "a warrior in │     │  [photo.jpg]  │
  │  a forest"    │     │               │
  └───────┬───────┘     └───────┬───────┘
          │                     │
          ▼                     ▼
  ┌───────────────┐    ┌───────────────┐
  │ Text Encoder  │    │ Image Encoder │
  │ (CLIP / T5)   │    │ (CLIP / IP)   │
  └───────┬───────┘    └───────┬───────┘
          │                     │
          └────────┬────────────┘
                   ▼
          ┌───────────────┐
          │   Diffusion   │
          │   Process     │
          │ (cross-attn)  │
          └───────┬───────┘
                  ▼
          ┌───────────────┐
          │ Output Image  │
          │ (text-guided  │
          │  + ref-guided)│
          └───────────────┘

Platform-Specific Reference Systems

Major platforms each have their own flavor:

Midjourney V7 (--oref): Omni-Reference system with --ow (Omni-Weight) parameter. Weights 0–50 = face only; 200–400 = balanced identity + style; 500–1000 = near-exact copy. Reports up to 95% consistency.
DALL-E (Gen_ID): Maintains character identity within a conversation thread. 75–80% consistency. Tied to session — doesn't transfer between conversations.
Stable Diffusion + IP-Adapter: Lightweight adapter modules that inject reference image features via cross-attention. Highly customizable with adjustable influence strength.
Leonardo AI: Character reference sheets with 92% consistency across 50+ pose variations.

Key Takeaways

Reference images anchor identity, pose, style, composition, wardrobe, and mood.
They work by encoding visual features and injecting them alongside text into the diffusion process.
Different platforms implement references differently; each has strengths and limitations.
Good references are clear, well-lit, and high-resolution. Reference weight is a balance, not binary.

What Reference Images Anchor

How Reference Systems Work (High Level)

Platform-Specific Reference Systems

External links

Exercise

Progress

댓글 0