What a Prompt Actually Does Inside the Model

피파 한 줄 정리: Prompt는 명령이 아니야. **Latent space에서 attractive region을 만드는 중력**이야. 이 mental model을 잡으면 'model이 왜 안 따랐지?'라는 잘못된 질문을 안 하게 돼.

Let's kill the biggest misconception right now: a prompt is not a command. You're not telling the model what to draw, like instructing an employee. You're nudging a trajectory through latent space — more like adjusting the wind direction for a sailboat. The boat (model) will go somewhere interesting regardless; your text just biases which direction it drifts.

Here's what actually happens when you type a prompt:

Step 1: Text Encoding

Your text is tokenized and fed into a text encoder — a neural network that converts words into numerical vectors. These vectors don't encode dictionary definitions; they encode associations learned from image-text pairs. So "sunset" doesn't mean "the sun going below the horizon" — it means "the visual pattern cluster that images captioned with 'sunset' tend to share."

Your prompt: "a golden retriever playing fetch on a beach at sunset"
                                    ↓
Text Encoder (CLIP / T5 / Mistral):
                                    ↓
Numerical vectors: [0.23, -0.87, 0.44, ...] × hundreds of dimensions
                                    ↓
These vectors are injected into the diffusion model via cross-attention
at EVERY denoising step

Step 2: Conditioning the Denoising

At each denoising step, the model uses cross-attention to look at your text vectors and incorporate them into its prediction. Think of it like a GPS recalculating the route at every intersection — the text keeps steering the denoising trajectory toward the region of latent space that matches your description.

Why "Command" Thinking Fails

If you think of prompts as commands, you'll be constantly frustrated. Consider:

❌ Command Thinking

"Draw a red car in the left third of the frame with a blue building behind it and exactly three birds in the sky on the right side"

✅ Trajectory Thinking

"A vintage red sports car parked on a city street, blue apartment building in the background, a few birds overhead, wide-angle street photography"

The command-style prompt tries to micromanage spatial layout — something the model handles through learned correlations, not literal positioning. The trajectory-style prompt describes a scene type and lets the model's learned patterns handle the composition.

What Your Prompt Can and Can't Do

Reliable	Possible but Inconsistent	Usually Fails
Subject identity ("a cat")	Precise spatial layout	Exact text in the image
Style/mood ("oil painting")	Specific counts ("three birds")	Pixel-perfect composition
Lighting ("warm sunset light")	Relative positions (left/right)	Complex logical conditions
Camera/lens ("macro shot")	Specific expressions/emotions	Sequential instructions
Color palette ("muted earth tones")	Specific hand poses	Negation via text alone

Key Takeaways

A prompt is not a command — it's a gravitational pull on the denoising trajectory.
Text gets encoded into numerical vectors and injected at every denoising step via cross-attention.
Prompts reliably control subject, style, mood, and lighting. They're unreliable for precise layout, counting, and text.
Think "describe a plausible image" rather than "issue drawing instructions."

Step 1: Text Encoding

Step 2: Conditioning the Denoising

Why "Command" Thinking Fails

What Your Prompt Can and Can't Do

External links

Exercise

Progress

댓글 0