mlx-audio TTS — Apple Silicon 위에서 말하기

Apple Silicon 위에서 말하기

mlx-audio 가 MLX 의 audio 확장 — text-to-speech, speech-to-text, 그리고 몇몇 audio-conditioned 유틸리티. Core mlx-lm 패키지 밖에 살아; 별도로 설치. Text-to-speech 면이 이 레슨이 다루는 거.

먼저 이해할 trade-off — TTS 품질이 빨리 움직이고 leading 모델이 몇 달마다 바뀜. 6 개월 전 인기 TTS 모델로 완벽하게 동작했던 코드 path 가 오늘은 새 모델 이름 필요할 수 있어. 패턴 (모델 로드, 합성, 저장) 은 안정; specific 모델 식별자와 그 voice slot 은 안 그래. 아래 모델 이름들을 2026-05 스냅샷으로 다뤄.

설치 + 최소 loop

두 명령. mlx env 의 pip install mlx-audio, 그 다음 TTS 모델 로드하고 한 문장을 WAV 파일로 합성하는 Python 스니펫.

Apple Silicon 의 latency 기대

M-class Mac 의 작은 TTS 모델 (Kokoro / Sesame / 비슷한 100M-parameter 클래스) 엔, 짧은 문장 합성의 wall-clock latency 가 전형적으로 audio 의 재생 duration 의 분수 — 즉 realtime 보다 빠름. 더 긴 텍스트나 더 큰 모델엔 latency 가 토큰 수와 모델 크기로 scale, 근데 M-Pro 이상에선 여전히 better-than-realtime 봐야. Slower-than-realtime 보면 가장 가능성 높은 원인은 모델이 아직 MLX 에 최적화 안 됐거나 (또는 unified memory 활용 안 하는 wrapper 통해 도는 것).

Voice 와 prompt

대부분 현대 TTS 모델이 voice 파라미터 (학습된 voice 의 discrete 셋에서 선택) 와 합성할 텍스트 받음. 일부는 추가 conditioning — emotion, speaking rate, SSML-style 마커 통한 pause 통제 — 받음. 정확한 API 위해 모델 카드 읽어; mlx-audio 의 raw 모델 주위 wrapper 는 보통 같은 통제 노출하지만 공통 API 에 맞게 이름 바꿈.

이 레슨에서 기억할 것

Mental model — MLX 의 TTS 가 로드, 실행, emit 하는 또 하나의 API 표면 — mlx-lm 의 generate 와 같은 모양, 근데 출력이 문자열 대신 audio 의 byte. Specific best-of-class 모델은 바뀔 거; loop 는 안 바뀔 거.

Code

mlx-audio 설치 (일회성)·bash

# In your `mlx` conda env
conda activate mlx
pip install mlx-audio

한 문장을 WAV 로 합성 (스케치 — 2026-05 기준 정확한 모델 이름)·python

# mlx-audio's high-level API is moving; check `python -m mlx_audio --help`
# for current sub-commands, and the model card on Hugging Face for required params.
# Typical pattern:

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Hello from MLX on Apple Silicon.",
    model_path="prince-canuma/Kokoro-82M",   # or whichever current TTS model you choose
    voice="af_heart",                          # voice slot from the model card
    file_prefix="hello_mlx",                   # writes hello_mlx.wav
)

# Output: a WAV file in the current directory.
# Wall-clock time is typically a fraction of the audio's playback length on
# an M-Pro or Ultra; older / lower-tier chips may be at or near realtime.

Exercise

mlx env 에 mlx-audio 설치. 권장 TTS 모델 식별자 위해 현재 README 읽어 (이 글 읽을 때 즈음 Kokoro 너머로 진화했을 수 있음). 같은 모델의 세 다른 voice slot 으로 세 문장 합성하고 셋 다 들어. 어느 voice 가 가장 자연스럽게 느껴졌고 wall-clock latency 가 재생 duration 와 어떻게 비교됐는지 두 문장.