피파 한 줄 정리: 'oil painting' 한 단어가 brush texture·canvas·색감·composition까지 다 부르는 이유 — 학습 데이터의 statistical association.
Type "Renaissance oil painting, chiaroscuro lighting, three-quarter portrait" and the model produces something that genuinely looks like a Renaissance painting — with appropriate brush texture, dramatic shadow, muted earth tones, and classical composition. How? Where did this knowledge come from?
Here's the analogy: imagine a child who grew up in the world's greatest art museum, spending every day surrounded by millions of paintings, photographs, and videos. She was never taught art theory — no one explained the rule of thirds, complementary colors, or Rembrandt lighting. But through sheer exposure, she absorbed these patterns. She can't articulate the rules, but she can feel when something looks "right."
What the Training Data Teaches
During training, the model processes billions of image-text pairs. These pairs come from the internet — a mix of professional photography, fine art, stock images, screenshots, product shots, film stills, and everything in between. From this corpus, the model implicitly learns:
Style as a Pattern Language:
- "Oil painting" = visible brush strokes + canvas texture + specific color palettes + particular ways light falls
- "35mm film photography" = specific grain patterns + lens characteristics + color science + common compositions
- "Anime" = specific proportions + line work + color blocking + eye styles + shading conventions
- "Cyberpunk" = neon colors + rain + dark environments + chrome surfaces + Asian typography
Lighting as Physics Approximation:
- How shadows fall based on light direction
- How different light sources (sun, neon, candle, flash) produce different color temperatures and shadow qualities
- How materials interact with light (metal reflects, skin scatters, glass refracts)
- How atmospheric effects (fog, dust, rain) scatter light
Composition as Visual Grammar:
- Where subjects are typically placed in professional photographs
- How foreground, midground, and background create depth
- Leading lines, framing, negative space, visual weight
- Camera angle conventions: bird's eye, worm's eye, eye level, Dutch angle
Why This Creates Bias
The training data isn't a neutral sample of all possible images. It's a sample of what's on the internet, which is heavily biased toward:
- Professional, well-composed photographs (highly shared/liked content)
- Western aesthetic conventions (dominant in large-scale datasets)
- Popular, polished styles (commercial photography, trending digital art)
- Conventional beauty standards and common demographics
This is why default generations (without strong style direction) tend to look polished, conventional, and sometimes generic. The model has learned that "most images the internet calls good look like this," so it gravitates toward those patterns.
- Style, lighting, and composition knowledge comes from statistical patterns learned across billions of image-text pairs.
- The model doesn't "understand" art theory — it has absorbed correlations between text descriptions and visual patterns.
- Training data bias means defaults lean toward polished, conventional, Western aesthetics.
- You can steer away from defaults, but you're working against strong learned priors — be explicit about what you want.