C.W.K.
Stream
Lesson 03 of 10 · published

TTS, Voice Cloning, and Emotional Delivery

~18 min · audio, voice, l3

Level 0Spark
0 XP0/100 lessons0/14 achievements
0/200 XP to next level200 XP to go0% complete

피파 한 줄 정리: Voice 컨트롤은 3 channel: 무엇을 말하나 (text), 누가 말하나 (voice·cloning), 어떻게 말하나 (emotion·pacing). 가장 강력한 'direction tool'은 사실 *text 자체*야.

Mental model: Think of voice generation as having three levels of control, like a sound studio mixing board. The first channel is what is said (the text). The second is who says it (the voice identity). The third is how they say it (emotion, pacing, emphasis). Each channel is independently adjustable, and the magic happens when you dial all three precisely.

Channel 1: What — Text-to-Speech (TTS)

At its most basic, TTS converts written text into spoken audio. Modern TTS has moved far beyond the robotic voices of GPS navigation. Current models produce speech that's nearly indistinguishable from human recordings in blind tests.

Key considerations:

  • Pronunciation: Most models handle standard words well, but technical terms, brand names, and unusual words may be mispronounced. Many platforms allow phonetic hints.
  • Punctuation matters: Commas create pauses. Periods create stops. Em-dashes create dramatic pauses. Question marks shift intonation upward. Your punctuation is a delivery script.
  • Length and pacing: Short sentences sound punchy and urgent. Long flowing sentences sound contemplative. The structure of your text directly shapes the vocal performance.

Channel 2: Who — Voice Identity and Cloning

Voice cloning creates a synthetic version of a specific voice from a sample recording:

  • Instant cloning: 15–60 seconds of sample audio → usable clone. Quick but less accurate.
  • Professional cloning: 5–30 minutes of studio recording → high-fidelity clone. Used for commercial voice actors licensing their voice.

ElevenLabs reports 94% voice similarity in cloning, meaning listeners can reliably identify the cloned voice as the same speaker. This enables consistent character voices across long projects (audiobooks, series, games) without scheduling the original speaker for every session.

Channel 3: How — Emotional Delivery and Style Direction

The most nuanced and challenging channel. Telling a voice model to "sound excited" or "speak sadly" requires the model to adjust multiple vocal parameters simultaneously:

  • Prosody: The rhythm, stress, and intonation of speech. Happy speech has wider pitch variation and faster tempo. Sad speech is flatter and slower.
  • Emphasis: Which words are stressed. "I didn't say HE stole it" vs. "I didn't say he STOLE it" — same words, different meanings.
  • Pacing: Speed of delivery, placement of pauses, breath patterns.
  • Vocal quality: Breathy, gravelly, clear, strained, whispered, shouted.
Key Takeaways
  • Voice generation has three control channels: what (text), who (voice identity), how (emotional delivery).
  • Voice cloning achieves ~94% similarity from small samples — powerful but demands ethical responsibility.
  • Emotional delivery is controlled through style parameters AND through the text structure itself.
  • Punctuation and sentence structure are your primary "direction" tools — write how you want it spoken.

Code

예시 코드·text
# ElevenLabs style direction examples:

voice_settings:
  stability: 0.5          # Lower = more expressive variation
  similarity_boost: 0.8   # Higher = closer to reference voice
  style: 0.6              # Higher = more stylistic expression

# OpenAI gpt-4o-mini-tts allows natural language direction:
instructions: "Speak warmly and conversationally, like
  explaining something to a close friend. Use natural
  pauses and emphasis. Slightly faster pace, genuine
  enthusiasm without sounding forced."

External links

Exercise

짧은 paragraph (50 단어). TTS에서 3가지 다른 stability/style 설정으로 generate. 각 parameter가 delivery를 어떻게 바꿨는지 문서화. '친구가 설명하는' 느낌의 setting 찾기.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.