피파 한 줄 정리: Reference image는 face·pose·style·composition·wardrobe·mood — 어떤 dimension이든 visual anchor를 줘. Tool마다 weight 슬라이더 다루는 법이 다름.
Mental model: A film director doesn't just tell the cinematographer "make it look moody." She shows a stack of reference stills: a frame from Blade Runner for the lighting, a Vermeer painting for the color palette, a fashion editorial for the pose. These references collapse an enormous space of possible interpretations down to a specific neighborhood of visual intent. That's exactly what reference images do for AI generators.
What Reference Images Anchor
A reference image can guide the model on many different dimensions, depending on the tool and technique:
- Identity/Face: "Make this person look like the person in this photo" — face anchoring for character consistency.
- Pose: "Use this body position" — structural guidance without copying appearance.
- Style/Aesthetic: "Make it look like this painting" — transferring color palette, brushwork, mood.
- Composition: "Arrange elements like this photo" — preserving layout while changing content.
- Wardrobe/Props: "This character should wear this outfit" — product or costume reference.
- Mood/Lighting: "Light it like this scene" — atmospheric guidance.
How Reference Systems Work (High Level)
Different platforms implement references differently, but the core idea is the same: the reference image is encoded into the model's latent space (usually via a vision encoder like CLIP or a specialized adapter), and its features are injected into the generation process alongside the text prompt.
┌───────────────┐ ┌───────────────┐
│ Text Prompt │ │Reference Image│
│ "a warrior in │ │ [photo.jpg] │
│ a forest" │ │ │
└───────┬───────┘ └───────┬───────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Text Encoder │ │ Image Encoder │
│ (CLIP / T5) │ │ (CLIP / IP) │
└───────┬───────┘ └───────┬───────┘
│ │
└────────┬────────────┘
▼
┌───────────────┐
│ Diffusion │
│ Process │
│ (cross-attn) │
└───────┬───────┘
▼
┌───────────────┐
│ Output Image │
│ (text-guided │
│ + ref-guided)│
└───────────────┘
Platform-Specific Reference Systems
Major platforms each have their own flavor:
- Midjourney V7 (--oref): Omni-Reference system with --ow (Omni-Weight) parameter. Weights 0–50 = face only; 200–400 = balanced identity + style; 500–1000 = near-exact copy. Reports up to 95% consistency.
- DALL-E (Gen_ID): Maintains character identity within a conversation thread. 75–80% consistency. Tied to session — doesn't transfer between conversations.
- Stable Diffusion + IP-Adapter: Lightweight adapter modules that inject reference image features via cross-attention. Highly customizable with adjustable influence strength.
- Leonardo AI: Character reference sheets with 92% consistency across 50+ pose variations.
- Reference images anchor identity, pose, style, composition, wardrobe, and mood.
- They work by encoding visual features and injecting them alongside text into the diffusion process.
- Different platforms implement references differently; each has strengths and limitations.
- Good references are clear, well-lit, and high-resolution. Reference weight is a balance, not binary.