피파 한 줄 정리: Pro audio는 4 layer: dialogue (top) → SFX → ambience → music (bottom·duck). Native audio 써도 layer별로 갈아끼우는 게 정석.
Mental model: A film's audio is never "one sound." It's a carefully constructed stack of independent layers, just like a Photoshop file has layers for background, subject, and text. Sound designers work with four primary audio layers, each serving a different purpose. Understanding this layered approach is essential for creating professional generative media.
The Four Audio Layers
Layer 1: Dialogue / Voice
The most important layer — humans prioritize speech above all other sounds. Dialogue must be clear, well-timed, and emotionally appropriate. In generative media, this comes from TTS, voice cloning, or native audio in video models.
- Always sits "on top" of the mix — other layers support it, never overpower it.
- Needs clean recording quality (no background noise artifacts).
- Timing relative to visuals is critical (see Lesson 5 on lip sync).
Layer 2: Sound Effects (SFX)
Specific, identifiable sounds tied to visible events: footsteps, doors, typing, glass clinking, explosions. They make visual events feel real and impactful.
- Must be synchronized to visual events (a door closes exactly when the sound plays).
- Subtle SFX (cloth rustling, keyboard typing) add realism that viewers feel but don't consciously notice.
- Can be generated, sourced from libraries, or extracted from native audio.
Layer 3: Ambience / Room Tone
The background atmosphere of a scene: café chatter, forest insects, city traffic, air conditioning hum. Ambience establishes "where you are" and fills the sonic space between events.
- Should be continuous and consistent — cutting abruptly breaks immersion.
- Changes between scenes signal location shifts to the viewer.
- Often the most overlooked layer, but its absence creates an eerie "vacuum" feeling.
Layer 4: Music / Score
Background music sets emotional tone: tension, joy, wonder, nostalgia. It guides the viewer's emotional response independently of visuals.
- Should complement, not compete with dialogue.
- Volume ducking: music automatically drops during dialogue and rises during visual-only moments.
- Style should match the visual aesthetic (cinematic visuals + lo-fi hip-hop = mismatch).
The Audio Layer Stack: Priority: ┌──────────────────────────────────────────┐ Highest │ Layer 1: DIALOGUE / VOICE │ (always │ "Welcome to our product demo..." │ on top) ├──────────────────────────────────────────┤ │ Layer 2: SOUND EFFECTS │ Synced to │ [click] [whoosh] [typing] │ visual events ├──────────────────────────────────────────┤ │ Layer 3: AMBIENCE │ Continuous │ ~~~office hum, distant traffic~~~ │ background ├──────────────────────────────────────────┤ Lowest │ Layer 4: MUSIC / SCORE │ (ducks under │ ♫ gentle corporate background ♫ │ dialogue) └──────────────────────────────────────────┘
Why Layers Matter for Generative Media
When you generate everything in one pass (native audio), you get all layers mixed together — convenient but inflexible. If the voice is perfect but the music is wrong, you can't change just the music without regenerating everything.
When you generate layers separately, you have full mix control: adjust dialogue volume, swap out the music, time sound effects precisely, change ambient atmosphere. The tradeoff is more work for more control.
- Professional audio has four layers: dialogue, sound effects, ambience, and music.
- Dialogue sits on top; music on the bottom. SFX sync to events; ambience fills the space.
- Generating layers separately gives maximum control; native audio gives maximum convenience.
- Even with native audio, plan to replace or adjust individual layers in post-production.