피파 한 줄 정리: '3개 사과'를 모델은 *count*하지 않고, '3 sapples 캡션이 붙은 이미지의 통계 패턴'을 reproduce해. 1-2는 안전, 3-5는 risky, 6+는 그냥 'many'.
Mental model: Ask a friend to quickly sketch "five birds on a wire" from memory. Don't let them count — just draw fast. You'll probably get four, six, or a vague cluster. They know what five means, but translating that number into a precise spatial arrangement under time pressure is surprisingly hard. Image models face the same problem, except they never get to count.
Why Numbers Are Fuzzy
When you prompt "three red apples on a wooden table," the model processes "three" as a concept associated with small-group-ness. It's learned that "three apples" images in its training data usually show a small cluster — but it never learned to count to three as a discrete operation. The model is doing pattern completion, not arithmetic.
This means:
- Small numbers (1–2) are fairly reliable. "One cat" almost always gives one cat. "Two people shaking hands" usually works.
- Three to five is the danger zone. You might get the right count, but you might get ±1.
- Six and above becomes a rough "many" to the model. "Seven candles on a birthday cake" might give you five or nine.
Prompt: "three apples"
What you expect: What you might get:
🍎 🍎 🍎 🍎 🍎
🍎 🍎
(exactly 3) (oops, 4)
Why This Happens Architecturally
The text encoder converts "three" into a vector — a cloud of numbers representing the concept. That vector nudges the denoising process toward arrangements that correlate with "three-ness" in the training data. But the correlation is soft, not hard. The model doesn't have a counting subroutine that says "place exactly N objects." It generates a scene that feels like the right amount.
Think of it this way: if you saw thousands of stock photos tagged "three apples," some would have two with one partially hidden, others might have four with one cropped. The model learned from that noisy distribution, so its idea of "three" is a probability cloud, not the integer 3.
Strategies That Help
- Describe spatial arrangement: Instead of "five cups," try "five cups arranged in a single row from left to right." Spatial cues give the model structural scaffolding.
- Use small numbers: Stick to 1–3 when exact count matters. Anything higher is a gamble.
- Generate and select: Run multiple generations and pick the one with the correct count.
- Composite: Generate one perfect apple, then duplicate it in an image editor.
"Seven balloons floating in the sky"
"Three colorful balloons — red, blue, yellow — floating against a clear sky, evenly spaced"
- Models don't count — they predict arrangements that statistically correlate with number words.
- 1–2 objects: reliable. 3–5: risky. 6+: essentially "many."
- Describe spatial layout rather than just stating a number.
- For exact counts, composite in post-production.