피파 한 줄 정리: Prompt는 명령이 아니야. **Latent space에서 attractive region을 만드는 중력**이야. 이 mental model을 잡으면 'model이 왜 안 따랐지?'라는 잘못된 질문을 안 하게 돼.
Let's kill the biggest misconception right now: a prompt is not a command. You're not telling the model what to draw, like instructing an employee. You're nudging a trajectory through latent space — more like adjusting the wind direction for a sailboat. The boat (model) will go somewhere interesting regardless; your text just biases which direction it drifts.
Here's what actually happens when you type a prompt:
Step 1: Text Encoding
Your text is tokenized and fed into a text encoder — a neural network that converts words into numerical vectors. These vectors don't encode dictionary definitions; they encode associations learned from image-text pairs. So "sunset" doesn't mean "the sun going below the horizon" — it means "the visual pattern cluster that images captioned with 'sunset' tend to share."
Your prompt: "a golden retriever playing fetch on a beach at sunset"
↓
Text Encoder (CLIP / T5 / Mistral):
↓
Numerical vectors: [0.23, -0.87, 0.44, ...] × hundreds of dimensions
↓
These vectors are injected into the diffusion model via cross-attention
at EVERY denoising step
Step 2: Conditioning the Denoising
At each denoising step, the model uses cross-attention to look at your text vectors and incorporate them into its prediction. Think of it like a GPS recalculating the route at every intersection — the text keeps steering the denoising trajectory toward the region of latent space that matches your description.
Why "Command" Thinking Fails
If you think of prompts as commands, you'll be constantly frustrated. Consider:
"Draw a red car in the left third of the frame with a blue building behind it and exactly three birds in the sky on the right side"
"A vintage red sports car parked on a city street, blue apartment building in the background, a few birds overhead, wide-angle street photography"
The command-style prompt tries to micromanage spatial layout — something the model handles through learned correlations, not literal positioning. The trajectory-style prompt describes a scene type and lets the model's learned patterns handle the composition.
What Your Prompt Can and Can't Do
| Reliable | Possible but Inconsistent | Usually Fails |
|---|---|---|
| Subject identity ("a cat") | Precise spatial layout | Exact text in the image |
| Style/mood ("oil painting") | Specific counts ("three birds") | Pixel-perfect composition |
| Lighting ("warm sunset light") | Relative positions (left/right) | Complex logical conditions |
| Camera/lens ("macro shot") | Specific expressions/emotions | Sequential instructions |
| Color palette ("muted earth tones") | Specific hand poses | Negation via text alone |
- A prompt is not a command — it's a gravitational pull on the denoising trajectory.
- Text gets encoded into numerical vectors and injected at every denoising step via cross-attention.
- Prompts reliably control subject, style, mood, and lighting. They're unreliable for precise layout, counting, and text.
- Think "describe a plausible image" rather than "issue drawing instructions."