피파 한 줄 정리: Generative media는 더 이상 silent가 아니야. Veo 3 (2025)가 video + dialogue + SFX + music을 한 prompt에 만들기 시작했고, 이게 prompt-to-media 정의를 바꾸고 있어.
Mental model: Think about the last time you watched a video on mute versus with sound. The muted version felt flat, disconnected, almost lifeless — even if the visuals were beautiful. Sound is half the experience. A door opening means nothing until you hear the creak. A rainstorm feels fake until you hear the patter. Generative media is now crossing the same threshold: it's not just about pixels anymore, it's about pixels plus sound waves.
Why Audio Matters for Generative Media
For years, image and video generation existed in a silent vacuum. You'd generate a beautiful video clip and then scramble to find royalty-free music, record a voiceover, or add sound effects manually. That workflow is changing rapidly because:
- Native audio in video models: Google's Veo 3 (launched May 2025) was the first major model to generate synchronized audio alongside video — dialogue, ambient sound, and background music, all from a text prompt.
- Standalone voice generation: ElevenLabs and OpenAI TTS have made high-quality voice synthesis accessible, affordable, and fast — voice cloning from 1 minute of audio, emotional delivery, 30+ languages.
- Sound effect generation: Dedicated models can now generate specific sound effects (footsteps, thunder, machine hum) from text descriptions.
- Music generation: Models like Suno and Udio generate full songs with lyrics, instruments, and production from text prompts.
The Generative Media Stack (2026):
┌─────────────────────────────────────────────────┐
│ FINAL OUTPUT │
│ (video + dialogue + SFX + music) │
└───────────────────┬─────────────────────────────┘
│
┌───────┬───────┬───┴───┬─────────┐
│ Video │ Voice │ Sound │ Music │
│ Gen │ Gen │ FX │ Gen │
│ │ │ Gen │ │
└───┬───┘└──┬───┘└──┬───┘└───┬────┘
│ │ │ │
Separate models OR unified multimodal model
The Shift From Silent to Multimodal
This shift matters because it changes what a "prompt" can produce. Two years ago, a prompt like "a barista steams milk in a busy coffee shop" would generate a silent video clip. Today, with the right model, the same prompt can generate the video plus the hissing of steam, the clinking of cups, background chatter, and perhaps a jazz track playing softly. The prompt-to-media pipeline is becoming truly multimodal.
- Modern generative media includes audio: voice, sound effects, music, and ambient sound.
- Native audio in video models (Veo 3) eliminates the need for separate audio post-production in some workflows.
- Standalone voice (ElevenLabs, OpenAI TTS), sound effects, and music generation are all production-ready.
- The prompt-to-media pipeline is becoming truly multimodal — a single prompt can produce video + audio.