C.W.K.
Stream
Lesson 01 of 10 · published

What a Prompt Actually Does Inside the Model

~12 min · prompting, control, l1

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: Prompt는 명령이 아니야. **Latent space에서 attractive region을 만드는 중력**이야. 이 mental model을 잡으면 'model이 왜 안 따랐지?'라는 잘못된 질문을 안 하게 돼.

Let's kill the biggest misconception right now: a prompt is not a command. You're not telling the model what to draw, like instructing an employee. You're nudging a trajectory through latent space — more like adjusting the wind direction for a sailboat. The boat (model) will go somewhere interesting regardless; your text just biases which direction it drifts.

Here's what actually happens when you type a prompt:

Step 1: Text Encoding

Your text is tokenized and fed into a text encoder — a neural network that converts words into numerical vectors. These vectors don't encode dictionary definitions; they encode associations learned from image-text pairs. So "sunset" doesn't mean "the sun going below the horizon" — it means "the visual pattern cluster that images captioned with 'sunset' tend to share."

Your prompt: "a golden retriever playing fetch on a beach at sunset"
                                    ↓
Text Encoder (CLIP / T5 / Mistral):
                                    ↓
Numerical vectors: [0.23, -0.87, 0.44, ...] × hundreds of dimensions
                                    ↓
These vectors are injected into the diffusion model via cross-attention
at EVERY denoising step

Step 2: Conditioning the Denoising

At each denoising step, the model uses cross-attention to look at your text vectors and incorporate them into its prediction. Think of it like a GPS recalculating the route at every intersection — the text keeps steering the denoising trajectory toward the region of latent space that matches your description.

Why "Command" Thinking Fails

If you think of prompts as commands, you'll be constantly frustrated. Consider:

❌ Command Thinking

"Draw a red car in the left third of the frame with a blue building behind it and exactly three birds in the sky on the right side"

✅ Trajectory Thinking

"A vintage red sports car parked on a city street, blue apartment building in the background, a few birds overhead, wide-angle street photography"

The command-style prompt tries to micromanage spatial layout — something the model handles through learned correlations, not literal positioning. The trajectory-style prompt describes a scene type and lets the model's learned patterns handle the composition.

What Your Prompt Can and Can't Do

ReliablePossible but InconsistentUsually Fails
Subject identity ("a cat")Precise spatial layoutExact text in the image
Style/mood ("oil painting")Specific counts ("three birds")Pixel-perfect composition
Lighting ("warm sunset light")Relative positions (left/right)Complex logical conditions
Camera/lens ("macro shot")Specific expressions/emotionsSequential instructions
Color palette ("muted earth tones")Specific hand posesNegation via text alone
Key Takeaways
  • A prompt is not a command — it's a gravitational pull on the denoising trajectory.
  • Text gets encoded into numerical vectors and injected at every denoising step via cross-attention.
  • Prompts reliably control subject, style, mood, and lighting. They're unreliable for precise layout, counting, and text.
  • Think "describe a plausible image" rather than "issue drawing instructions."

External links

Exercise

Generic prompt ('a beautiful landscape') 골라. 3가지로 재작성: 명령으로, 기존 image의 caption으로, director의 scene 묘사로. 같은 seed로 셋 다 generate. 어느 접근이 best output?

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.