Pose and Composition Control: ControlNet Concepts

피파 한 줄 정리: ControlNet은 prompt에 *structural conditioning*을 추가해 (edge·depth·pose·sketch). Text가 못 하는 spatial 정밀 컨트롤을 visual로 줘.

Mental model: Imagine you're a film director working with a body double. You position the double exactly how you want — arms up, head tilted, stepping forward — then photograph them as a silhouette. Now you hand that silhouette to an artist and say "paint this person in this exact pose, but make them a medieval knight in a forest." The silhouette controls the structure; the text controls the content. That's ControlNet.

What ControlNet Does

ControlNet adds a structural conditioning signal to the diffusion process. Instead of relying only on text (which is imprecise about spatial details), you provide an explicit visual guide that tells the model where things should be and how they should be shaped. The text then tells the model what those things should look like.

Types of Control Signals

Different types of control images extract different structural information:

Control Type     What It Captures              Best For
  ──────────────────────────────────────────────────────────────
  Canny Edge       Edges and outlines            Sharp structural guidance
  Depth Map        Distance from camera          3D spatial arrangement
  Pose (OpenPose)  Skeleton/joint positions      Human body positioning
  Normal Map       Surface orientation           Lighting-consistent surfaces
  Segmentation     Semantic regions              Scene layout (sky/ground/building)
  Scribble/Sketch  Rough hand-drawn guides       Quick compositional ideas
  Lineart          Clean line drawings           Illustration and manga

How It Works (High Level)

ControlNet operates as a parallel neural network that "shadows" the main diffusion model. The control image is processed by this parallel network, which produces feature maps that are injected into the main model at each denoising step. This means the model is simultaneously guided by:

Text prompt (semantic content: what the scene is about)
Control signal (structural content: where things are positioned)
Random noise (variation: the specific creative interpretation)

Control Strength

Like reference weight, control strength is a slider:

Low (0.2–0.4): The control image is a suggestion. The model may deviate for artistic reasons.
Medium (0.5–0.7): Strong guidance. Structure is clearly followed but the model has room for natural interpretation.
High (0.8–1.0): Strict adherence. The output closely follows the control signal. Can sometimes look stiff or unnatural if the control image itself is imperfect.

Key Takeaways

ControlNet adds structural conditioning (edges, depth, pose) alongside text conditioning.
Different control types guide different aspects: edges for shape, depth for 3D, pose for body.
Control strength balances structural fidelity vs. creative freedom.
Multiple control signals can be stacked for multi-dimensional guidance.
A rough control image is better than 50 prompt iterations for spatial precision.

Pose and Composition Control: ControlNet Concepts

What ControlNet Does

Types of Control Signals

How It Works (High Level)

Control Strength

Code

External links

Exercise

Progress

댓글 0