C.W.K.
Stream
Lesson 02 of 10 · published

Why Counting Is Hard

~15 min · failures, diagnosis, l2

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: '3개 사과'를 모델은 *count*하지 않고, '3 sapples 캡션이 붙은 이미지의 통계 패턴'을 reproduce해. 1-2는 안전, 3-5는 risky, 6+는 그냥 'many'.

Mental model: Ask a friend to quickly sketch "five birds on a wire" from memory. Don't let them count — just draw fast. You'll probably get four, six, or a vague cluster. They know what five means, but translating that number into a precise spatial arrangement under time pressure is surprisingly hard. Image models face the same problem, except they never get to count.

Why Numbers Are Fuzzy

When you prompt "three red apples on a wooden table," the model processes "three" as a concept associated with small-group-ness. It's learned that "three apples" images in its training data usually show a small cluster — but it never learned to count to three as a discrete operation. The model is doing pattern completion, not arithmetic.

This means:

  • Small numbers (1–2) are fairly reliable. "One cat" almost always gives one cat. "Two people shaking hands" usually works.
  • Three to five is the danger zone. You might get the right count, but you might get ±1.
  • Six and above becomes a rough "many" to the model. "Seven candles on a birthday cake" might give you five or nine.
Prompt: "three apples"

  What you expect:     What you might get:

    🍎 🍎 🍎              🍎 🍎
                          🍎 🍎
   (exactly 3)           (oops, 4)

Why This Happens Architecturally

The text encoder converts "three" into a vector — a cloud of numbers representing the concept. That vector nudges the denoising process toward arrangements that correlate with "three-ness" in the training data. But the correlation is soft, not hard. The model doesn't have a counting subroutine that says "place exactly N objects." It generates a scene that feels like the right amount.

Think of it this way: if you saw thousands of stock photos tagged "three apples," some would have two with one partially hidden, others might have four with one cropped. The model learned from that noisy distribution, so its idea of "three" is a probability cloud, not the integer 3.

Strategies That Help

  • Describe spatial arrangement: Instead of "five cups," try "five cups arranged in a single row from left to right." Spatial cues give the model structural scaffolding.
  • Use small numbers: Stick to 1–3 when exact count matters. Anything higher is a gamble.
  • Generate and select: Run multiple generations and pick the one with the correct count.
  • Composite: Generate one perfect apple, then duplicate it in an image editor.
❌ Weak

"Seven balloons floating in the sky"

✅ Better

"Three colorful balloons — red, blue, yellow — floating against a clear sky, evenly spaced"

Key Takeaways
  • Models don't count — they predict arrangements that statistically correlate with number words.
  • 1–2 objects: reliable. 3–5: risky. 6+: essentially "many."
  • Describe spatial layout rather than just stating a number.
  • For exact counts, composite in post-production.

External links

Exercise

'3 apples on a table'·'7 apples'·'12 apples'·'many apples' 각각 generate. 실제 개수 세기. 어디서 count가 unreliable해졌나?

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.