피파 한 줄 정리: 이거 한 줄이 트랙 1의 핵심: **모델은 *예측*해, *이해*하는 게 아니야**. 이걸 prompt 짤 때마다 까먹으면 'model이 내 의도를 이해 못 해'라고 잘못된 진단을 내리게 돼.
Here's the most important mental shift in this entire course: these models predict plausible outputs — they don't understand the world.
Think of a weather forecaster who has memorized every weather pattern for the last century. When she says "tomorrow will be sunny," she's not controlling the weather or understanding atmospheric physics at a molecular level. She's recognizing that today's conditions closely match historical patterns that were followed by sunny days. She's making a sophisticated prediction based on learned correlations.
Image models work exactly this way. When you type "a golden retriever playing fetch on a beach at sunset," the model doesn't think: "Okay, a golden retriever is a dog breed with this bone structure, fur is affected by wind and moisture, the sun at this angle creates these shadows..." Instead, it essentially says: "Given everything I've learned about images paired with similar text, what would a plausible image look like?"
Correlation vs. Causation in Action
This distinction has real consequences:
What you think happens: What actually happens:
"Draw 3 apples" "Draw 3 apples"
↓ ↓
Model counts: 1, 2, 3 Model predicts: "images with
↓ 'three apples' text usually
Draws exactly 3 apples have this many round objects"
↓ ↓
✅ Always works Sometimes 2, sometimes 4 🤷
The model has learned the correlation between the text "three apples" and images containing roughly three apple-like objects. But it hasn't learned the concept of counting. This is why you sometimes get two apples or four. It's not stupid — it's doing exactly what it was designed to do: predicting a plausible visual pattern. Counting is just not what that pattern prediction reliably captures.
Why This Matters for You
Once you stop expecting "understanding" and start expecting "prediction," everything becomes clearer:
- Prompt failures make sense: The model isn't ignoring you — your words didn't reliably activate the patterns you wanted.
- Inconsistency is expected: Predictions from statistical patterns naturally vary — that's why the same prompt gives different results each time.
- Strengths make sense: The model is great at things where visual patterns are consistent (faces, landscapes, common compositions) and weak where patterns are sparse or irregular (precise text, counting, novel combinations).
- Control strategies change: You stop trying to "explain" things to the model and start learning which words and patterns reliably trigger which visual outputs.
- Generative models predict plausible outputs from learned correlations — they don't understand concepts.
- Failures like wrong finger counts or misspelled text reveal the limits of pattern prediction.
- Stop expecting "understanding" and start thinking in terms of "which patterns does my input activate?"
- This reframe transforms how you prompt, diagnose failures, and build workflows.