피파 한 줄 정리: ControlNet은 prompt에 *structural conditioning*을 추가해 (edge·depth·pose·sketch). Text가 못 하는 spatial 정밀 컨트롤을 visual로 줘.
Mental model: Imagine you're a film director working with a body double. You position the double exactly how you want — arms up, head tilted, stepping forward — then photograph them as a silhouette. Now you hand that silhouette to an artist and say "paint this person in this exact pose, but make them a medieval knight in a forest." The silhouette controls the structure; the text controls the content. That's ControlNet.
What ControlNet Does
ControlNet adds a structural conditioning signal to the diffusion process. Instead of relying only on text (which is imprecise about spatial details), you provide an explicit visual guide that tells the model where things should be and how they should be shaped. The text then tells the model what those things should look like.
Types of Control Signals
Different types of control images extract different structural information:
Control Type What It Captures Best For ────────────────────────────────────────────────────────────── Canny Edge Edges and outlines Sharp structural guidance Depth Map Distance from camera 3D spatial arrangement Pose (OpenPose) Skeleton/joint positions Human body positioning Normal Map Surface orientation Lighting-consistent surfaces Segmentation Semantic regions Scene layout (sky/ground/building) Scribble/Sketch Rough hand-drawn guides Quick compositional ideas Lineart Clean line drawings Illustration and manga
How It Works (High Level)
ControlNet operates as a parallel neural network that "shadows" the main diffusion model. The control image is processed by this parallel network, which produces feature maps that are injected into the main model at each denoising step. This means the model is simultaneously guided by:
- Text prompt (semantic content: what the scene is about)
- Control signal (structural content: where things are positioned)
- Random noise (variation: the specific creative interpretation)
Control Strength
Like reference weight, control strength is a slider:
- Low (0.2–0.4): The control image is a suggestion. The model may deviate for artistic reasons.
- Medium (0.5–0.7): Strong guidance. Structure is clearly followed but the model has room for natural interpretation.
- High (0.8–1.0): Strict adherence. The output closely follows the control signal. Can sometimes look stiff or unnatural if the control image itself is imperfect.
- ControlNet adds structural conditioning (edges, depth, pose) alongside text conditioning.
- Different control types guide different aspects: edges for shape, depth for 3D, pose for body.
- Control strength balances structural fidelity vs. creative freedom.
- Multiple control signals can be stacked for multi-dimensional guidance.
- A rough control image is better than 50 prompt iterations for spatial precision.