피파 한 줄 정리: Text rendering이 어려운 이유: 모델은 '글자처럼 *생긴* 픽셀'을 예측해, '글자를 *철자*'하지 않아. 짧고 흔한 단어는 OK, 긴 거나 특이한 거는 깨져.
Mental model: Imagine a spelling bee contestant who learned every word by staring at blurry photos of books from across a room. She recognizes the shapes of words — the tall letters, the descenders, the general silhouette — but she never learned the alphabet itself. When asked to spell "RESTAURANT," she writes something that looks like it should be right: "RESTUARNT." Close enough to fool a quick glance, but wrong on inspection.
That's essentially what early image generators did with text. They learned that signs have blocky letters, book covers have serif fonts, and storefronts have certain shapes — but they never learned spelling as a rule system. The model predicts pixels that statistically resemble text, not characters that logically form words.
The Token-to-Pixel Gap
Here's the core problem: when you type "a coffee shop sign that says OPEN," the word "OPEN" gets processed by a text encoder as a semantic concept — the idea of openness, availability, welcome. But rendering the specific pixel arrangement O-P-E-N on a sign requires something entirely different: precise spatial placement of individual letter forms. The model has to bridge from meaning-space to pixel-space, and those are very different domains.
Text Prompt Latent Space Pixels
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ "OPEN" │ ──── encode ──→│ concept of │ ── denoise│ O P E N │
│ (tokens) │ │ openness, │ ── ──→ │ (hopefully) │
│ │ │ sign-like │ │ │
└──────────┘ └──────────────┘ └──────────────┘
Meaning Fuzzy pattern Exact glyphs
(abstract) (statistical) (precise)
The gap between "fuzzy pattern" and "exact glyphs" is where the errors creep in. Natural images tolerate variation — a slightly different cloud shape is still a cloud. But text tolerates zero variation. "COFFE SHPO" is not coffee shop. There's no graceful degradation with spelling.
Why Newer Models Got Better
Models like DALL-E 3, Ideogram, and Flux made significant progress on text rendering. How? Several strategies:
- Dedicated text rendering pathways: Some models now include specialized sub-networks trained specifically on text-in-image data, essentially teaching the model the alphabet as a separate skill.
- Better training data curation: Including more high-quality images with clearly legible text (signs, posters, book covers) helps the model learn letter forms.
- Longer text encoders: Models using T5-XXL or similar large language model encoders can represent individual characters more precisely than CLIP-based encoders.
- Character-level awareness: Some architectures now process text character-by-character rather than as whole-word tokens, giving them spelling precision.
"A poster for a jazz festival with all the details written on it"
"A vintage jazz poster with bold text reading JAZZ NIGHT, warm orange tones, art deco style, minimal text"
- Text rendering fails because models predict pixel patterns, not spelled words.
- Short, common words work best; long or unusual text is unreliable.
- Newer models (DALL-E 3, Ideogram, Flux) improved via dedicated text pathways and better encoders.
- Professional workflow: generate the image, add text in post-production.