피파 한 줄 정리: Spatial preposition (left/right, behind/in front)은 cross-attention의 약한 link야. Text로 정확한 layout을 컨트롤하려는 건 거의 항상 실패.
Mental model: Imagine describing your living room layout to a friend over the phone: "The couch is to the left of the TV, the lamp is behind the armchair, and there's a rug between the coffee table and the bookshelf." Your friend sketches it out and gets the general vibe right — it looks like a living room — but the couch is on the right, the lamp is in front, and the rug is in the wrong place. They understood the objects perfectly but scrambled the relationships.
That's exactly what happens with image generators. Objects are easy. Spatial prepositions — left, right, behind, between, above, exactly three windows — are unreliable.
Why Left/Right Gets Confused
When you say "a cat on the left and a dog on the right," the model processes "cat," "dog," and a vague spatial signal. But "left" and "right" are relative concepts that depend on viewpoint, and the text encoder often represents them weakly. The cross-attention mechanism that links text tokens to image regions has limited spatial precision — it can roughly associate "cat" with one area and "dog" with another, but which side is which is a coin flip in many architectures.
What you prompted: What you might get:
┌─────────┬─────────┐ ┌─────────┬─────────┐
│ 🐱 │ 🐶 │ │ 🐶 │ 🐱 │
│ (left) │ (right) │ │ (left) │ (right) │
└─────────┴─────────┘ └─────────┴─────────┘
Intended layout Swapped!
Why "Behind" and "In Front Of" Are Hard
Depth relationships are even trickier. "A person standing behind a fence" requires the model to understand occlusion (the fence partially hides the person) and depth ordering. Sometimes the model places them side by side, or puts the fence behind the person, because the training data has both arrangements and the prompt signal isn't strong enough to disambiguate.
Exact Architectural Details
Try prompting "a building with exactly four windows on the second floor." You'll likely get three windows, or five, or four unevenly spaced ones. This combines the counting problem from the previous lesson with the spatial precision problem: the model must place a specific number of objects in a specific arrangement, and both the count and the arrangement are soft distributions, not hard constraints.
Why This Is an Architectural Limitation
The root cause is how text-to-image models bridge language and vision. Cross-attention mechanisms map text tokens to regions of the latent image, but this mapping is soft — it's a probability distribution, not a coordinate system. The model doesn't have a spatial canvas with labeled positions. It has a fog of attention weights that gently nudge content toward certain areas. Gentle nudges produce gentle (often wrong) spatial arrangements.
"A red ball to the left of a blue cube behind a green cylinder"
"A still-life arrangement: a red ball, a blue cube, and a green cylinder on a wooden table, studio lighting" (let the model compose naturally, fix positioning in post)
- Spatial prepositions (left/right, behind/in front) are weakly encoded and frequently ignored.
- Cross-attention maps text to image regions probabilistically, not with exact coordinates.
- Combining counting + spatial arrangement compounds both failure modes.
- For precise layout, use visual control (sketches, depth maps) rather than verbal description.