MLX profiling — 느린 op 찾기

느린 op 찾기, 고치기, 다시 측정

MLX 추론이 원하는 것보다 느릴 때, 답이 거의 "framework 가 느림" 아냐. 거의 항상 "specific 워크로드의 specific 연산이 병목, 그리고 측정할 때까지 어느 거인지 몰랐어." 이 레슨은 실용 profiling 도구.

두-tier 접근

Tier 1 — MLX-레벨 profiling. mlx-lm 와 mlx 자체가 MLX 의 세계관 안에서 시간 어디 가는지 말해줄 충분한 metadata 노출 — generation_tps, peak_memory, eval 당 timing. 어떤 호출이 느린지 식별 위해 사용.

Tier 2 — Metal-레벨 profiling. MLX-레벨 데이터로 충분치 않을 때 (예 한 matmul 까지 narrow 했는데 어떤 Metal kernel 이 느린지 알고 싶음), Xcode 의 Metal Debugger 가 GPU-레벨 trace 줘. 더 무거운 setup, 근데 병목이 MLX 의 API 표면 아래 있을 때 유일한 path.

Tier 1 — 내장 mlx-lm metadata

stream_generate 의 GenerationResponse 객체 (lm.lesson2 에서 다룸) 가 generation_tps 와 peak_memory 들고 다님. Production 에서 로그하면 별도 observability stack 없이 throughput + 메모리 trace 공짜.

비-LLM 워크로드엔 의심 MLX 호출을 경계의 mx.eval() 와 함께 time.perf_counter() 로 wrap. Hot loop 에 print() 로 측정하지 마 — 터미널 I/O 가 측정 contaminate (core.lesson7 에서 다룸).

Tier 2 — Xcode Metal Debugger

MLX 워크로드 도는 동안 Xcode → Debug → Capture GPU Frame 열어. 캡처가 timing 가진 각 Metal kernel dispatch 보여줌. 출력이 dense — 그것들 생산한 MLX 연산 이름 따른 kernel 보임 — 근데 가장 오래-도는 kernel 식별이 거의 항상 맞는 hot spot 가리킴.

이게 specific MLX 연산이 kernel 레벨에서 왜 느린지 이해하려고 할 때 맞는 도구. 대부분 사용자에겐 overkill; Tier 1 배우는 거 충분.

Fix-and-re-measure loop

액션 없는 profiling 은 디버깅 관광. 느린 op 식별했으면, fix 가 일반적으로 떨어지는 곳:

Hot 함수에 mx.compile 사용 (core.lesson7). 사용 가능한 가장 싼 2-5× speedup 종종.
안 했으면 양자화 (모델 크기가 메모리 대역폭에 영향, Apple Silicon 의 진짜 병목 종종).
계산 reshape 해서 MLX 의 강점에 매칭 — 더 적고 더 큰 op 가 많은 작은 거보다 더 잘 fuse.
CPU fallback 발견하면 교체 (MLX 에선 드물고, PyTorch MPS 에선 흔함).
MLX 업그레이드 — framework 가 빨리 움직이고 kernel 개선이 몇 주마다 출하.

그 다음 fix 확인 위해 다시-profile. Re-profile step 건너뛰지 마; 뭐가 더 빠른지에 대한 직감은 신뢰할 수 없어.

Code

Tier 1 — production 에서 generation_tps 와 peak_memory 로그·python

from mlx_lm import load, stream_generate
import time

model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")

last = None
t0 = time.perf_counter()
for chunk in stream_generate(model, tok, prompt="Tell me about MLX:", max_tokens=100):
    last = chunk

elapsed = time.perf_counter() - t0
print(f"Generated {last.generation_tokens} tokens in {elapsed:.2f}s")
print(f"  generation_tps : {last.generation_tps:.1f}")
print(f"  peak_memory MB : {last.peak_memory / 1024 / 1024:.1f}")
print(f"  finish_reason  : {last.finish_reason}")

# Sample on M3 Ultra Studio with Llama-3.2-1B Q4 (verified 2026-05-03):
#   generation_tps : ~600-800
#   peak_memory MB : ~685

Tier 1 — 임의 MLX 호출 정확히 timing·python

import mlx.core as mx
import time

x = mx.random.normal((4096, 4096))

def benchmark(fn, *args, n=10):
    # Warm up
    mx.eval(fn(*args))
    t0 = time.perf_counter()
    for _ in range(n):
        r = fn(*args)
        mx.eval(r)
    return (time.perf_counter() - t0) / n * 1000   # ms per call

t = benchmark(lambda x: x @ x.T, x)
print(f"matmul 4096x4096 : {t:.2f} ms / call")

# Don't put print() in the timed loop — terminal I/O contaminates timing.
# Always wrap with mx.eval() to materialize before reading the clock.

Exercise

실제로 돌리는 MLX 워크로드 가져와. Tier-1 profiling 추가 — 모든 generation 에 generation_tps 와 peak_memory 로그. 현실적 부하 아래 돌리고 숫자 캡처. 병목이 throughput, 메모리, 또는 다른 거 (예 MLX 외부의 전처리 latency) 인지 식별. 숫자 기반으로 먼저 뭘 최적화할지 두 문장.