mlx-audio STT — Apple Silicon 위에서 듣기

Apple Silicon 위에서 듣기

TTS 의 반대 면 — STT (speech-to-text) — 도 mlx-audio 에 살아. 지배 가족이 Whisper 와 그 파생물; mlx-audio 가 audio 파일 받아 transcribed 텍스트 돌려주는 Whisper-class 모델용 MLX-최적화 loader 출하.

STT 의 품질 이야기는 TTS 보다 더 성숙. Whisper-large-v3 (와 그 Apple-Silicon-친화 distillation) 가 대부분 언어에 진짜 강하고, M-Pro 이상의 transcription latency 가 realtime 훨씬 아래 — 1 시간 audio 를 분 단위로 transcribe 가능.

설치 + 최소 loop

같은 설치 (mlx-audio 패키지). Python loop 가 Whisper-class 모델 로드, WAV/MP3/M4A 파일 가리키게, 텍스트 더하기 원하면 segment 당 timestamp 받음.

받는 것

STT 가 보통 텍스트만 이상 돌려줘. 구조화된 response 가 보통 segment (시작 / 끝 시간 가진 audio 청크), 언어 감지, 모델이 노출하면 토큰 당 신뢰 점수 포함. 텍스트를 audio 에 다시 align 해야 할 때 (자막, 시간으로-검색) segment 사용; transcript 만 원할 때 join 된 텍스트만.

Realtime factor

STT 성능의 가장 유용한 단일 숫자가 realtime factor — 벽-시계 시간 초당 얼마의 audio 초를 transcribe 가능한지. Whisper-large MLX 모델의 M-Ultra 에선 전형 audio 에 5-15× 범위 realtime factor 기대. M-Pro 에선 2-5× 정도. Realtime factor < 1 은 라이브 스트림 따라갈 수 없다는 뜻; realtime factor > 1 은 라이브 audio 더하기 헤드룸 transcribe 가능.

Diarization 갭

STT (Whisper-class) 가 뭐 말했는지 줘, 누가 말했는지 아냐. Speaker diarization — 각 segment 에 speaker id 라벨 — 은 별도 capability 고 2026-05 기준 mlx-audio 에 아직 first-class 아냐. Diarization 필요하면, 전형 워크플로는 transcript 위해 mlx-audio 에서 STT 돌리고 diarization 위해 pyannote-audio (또는 비슷한 거, PyTorch 위) 돌리고 두 stream merge. 알 가치 있음; MLX 의 single-tool 문제 아직 아님.

Code

Whisper-class MLX 모델로 audio 파일 transcribe·python

# Pattern (exact API surface depends on mlx-audio version):
from mlx_audio.stt.generate import generate as transcribe

result = transcribe(
    audio_path="path/to/audio.wav",
    model_path="mlx-community/whisper-large-v3-mlx",   # or current best-of-class STT model
    output_path="transcript.txt",
)

# `result` is typically a dict-like object with:
#   .text       — full transcribed text
#   .segments   — list of {start, end, text} per segment
#   .language   — auto-detected language code (e.g. 'en', 'ko')
print(result.text[:500])

네 머신의 realtime factor 측정·python

import time, wave
from mlx_audio.stt.generate import generate as transcribe

audio_path = "path/to/audio.wav"

# Total audio duration via wave module (works for WAV; for MP3/M4A use ffprobe)
with wave.open(audio_path, "rb") as wf:
    duration_sec = wf.getnframes() / float(wf.getframerate())

t0 = time.perf_counter()
result = transcribe(audio_path=audio_path, model_path="mlx-community/whisper-large-v3-mlx")
elapsed = time.perf_counter() - t0

print(f"audio duration : {duration_sec:.1f} s")
print(f"transcribe wall : {elapsed:.1f} s")
print(f"realtime factor : {duration_sec / elapsed:.1f}x")

# Sample expected (M3 Ultra Studio with whisper-large-v3-mlx):
#   realtime factor : ~10x   (varies with audio difficulty and model)

Exercise

실제 워크플로의 대표 audio 샘플 골라 (듣는 podcast, 회의 녹음, 5-15 분의 무엇이든). Realtime-factor 측정과 함께 그것에 STT 돌려. 직감과 비교 — 결과가 예상보다 더 좋아 더 나빠? Multilingual 또는 시끄러운 샘플 있으면 그것도 돌려, realtime factor 와 품질 둘 다 어떻게 바뀌는지 알아채. 어떤 capacity 계획할지 두 문장.