피파 한 줄 정리: 손이 어려운 이유 = high articulation × occlusion × small pixel budget × no anatomical model. Jewelry·치아·glass도 같은 failure profile.
Mental model: Even professional artists say hands are the hardest thing to draw. There's a reason art students spend weeks on hand studies — hands have 27 bones, dozens of possible poses, constant self-occlusion, and they look dramatically different from every angle. Now imagine learning to draw hands only by looking at millions of photographs where hands appear in every conceivable position, size, lighting condition, and level of visibility. That's what an image model does — and the results are predictably messy.
Why Hands Are Especially Hard
Hands fail for a collision of reasons that don't affect simpler body parts:
- High articulation: Five fingers, each with three joints, plus a complex wrist. The combinatorial space of valid hand poses is enormous.
- Frequent occlusion: Fingers overlap each other, wrap around objects, hide behind palms. The model often sees partial hands in training data.
- Small pixel budget: In most photographs, hands occupy a tiny fraction of the image. The model has fewer pixels to work with and less training signal per hand.
- No anatomical model: The system doesn't know humans have five fingers. It knows that hand-shaped regions tend to have elongated protrusions, but the exact count is a soft statistical pattern, not a hard rule.
Jewelry, Teeth, and Small Details
The same class of problems affects any small, detailed, structurally specific element:
- Jewelry: Rings duplicate, necklace chains break mid-air, earrings mismatch between ears. These are tiny, precise structures that the model treats as decorative texture rather than structured objects.
- Teeth: Too many teeth, uneven sizes, teeth that blur into each other — because the model learned "mouth region has white shapes" rather than "32 teeth arranged in an arc."
- Glasses: Frames may connect inconsistently, lenses may differ in shape, or the glasses might partially merge with the face.
Transparent and Reflective Objects
Glass, water, mirrors, and chrome surfaces create another category of failure. These objects don't have a single stable appearance — they look different depending on what's behind or around them. The model must simultaneously generate the object and a plausible refraction or reflection, which requires implicit scene understanding that pixel-pattern matching doesn't provide.
Practical Workarounds
- Frame out the problem: Crop at the wrist, use close-up faces without hands, or put hands in pockets/behind back.
- Specify the pose: "Arms crossed," "holding a coffee mug with both hands" — constrained poses fail less.
- Inpaint after generation: Generate the overall image, then fix hands with a targeted edit pass.
- Use reference images: Provide a hand pose reference to guide the model (covered in Track 5).
- Hands fail because of high articulation, frequent occlusion, small pixel area, and no anatomical knowledge.
- Jewelry, teeth, glasses, and transparent objects fail for similar reasons: structural precision + high variability.
- The model predicts plausible visual patterns, not physically correct structures.
- Workarounds: constrain the pose, crop out problem areas, or fix with inpainting.