C.W.K.
Stream
Lesson 01 of 10 · published

Media Generation Now Includes Sound

~12 min · audio, voice, l1

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: Generative media는 더 이상 silent가 아니야. Veo 3 (2025)가 video + dialogue + SFX + music을 한 prompt에 만들기 시작했고, 이게 prompt-to-media 정의를 바꾸고 있어.

Mental model: Think about the last time you watched a video on mute versus with sound. The muted version felt flat, disconnected, almost lifeless — even if the visuals were beautiful. Sound is half the experience. A door opening means nothing until you hear the creak. A rainstorm feels fake until you hear the patter. Generative media is now crossing the same threshold: it's not just about pixels anymore, it's about pixels plus sound waves.

Why Audio Matters for Generative Media

For years, image and video generation existed in a silent vacuum. You'd generate a beautiful video clip and then scramble to find royalty-free music, record a voiceover, or add sound effects manually. That workflow is changing rapidly because:

  • Native audio in video models: Google's Veo 3 (launched May 2025) was the first major model to generate synchronized audio alongside video — dialogue, ambient sound, and background music, all from a text prompt.
  • Standalone voice generation: ElevenLabs and OpenAI TTS have made high-quality voice synthesis accessible, affordable, and fast — voice cloning from 1 minute of audio, emotional delivery, 30+ languages.
  • Sound effect generation: Dedicated models can now generate specific sound effects (footsteps, thunder, machine hum) from text descriptions.
  • Music generation: Models like Suno and Udio generate full songs with lyrics, instruments, and production from text prompts.
The Generative Media Stack (2026):

  ┌─────────────────────────────────────────────────┐
  │                 FINAL OUTPUT                     │
  │     (video + dialogue + SFX + music)             │
  └───────────────────┬─────────────────────────────┘
                      │
  ┌───────┬───────┬───┴───┬─────────┐
  │ Video │ Voice │ Sound │  Music  │
  │ Gen   │ Gen   │ FX    │  Gen    │
  │       │       │ Gen   │         │
  └───┬───┘└──┬───┘└──┬───┘└───┬────┘
      │       │       │       │
  Separate models OR unified multimodal model

The Shift From Silent to Multimodal

This shift matters because it changes what a "prompt" can produce. Two years ago, a prompt like "a barista steams milk in a busy coffee shop" would generate a silent video clip. Today, with the right model, the same prompt can generate the video plus the hissing of steam, the clinking of cups, background chatter, and perhaps a jazz track playing softly. The prompt-to-media pipeline is becoming truly multimodal.

Key Takeaways
  • Modern generative media includes audio: voice, sound effects, music, and ambient sound.
  • Native audio in video models (Veo 3) eliminates the need for separate audio post-production in some workflows.
  • Standalone voice (ElevenLabs, OpenAI TTS), sound effects, and music generation are all production-ready.
  • The prompt-to-media pipeline is becoming truly multimodal — a single prompt can produce video + audio.

External links

Exercise

30초 silent video clip 한 개. 4 audio layer (dialogue·SFX·ambience·music) 상상해서 각각 한 줄 묘사 작성. 적어도 한 layer 찾거나 generate. Video가 50% 더 살아있게 느껴짐.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.