Performance & Memory — 뭐가 RAM 을 먹나, 왜 먹나

추론 중 unified memory 를 먹는 세 가지

모델 weight. 큰 상수. 7B Q4 모델은 ~5 GB weight 사용; 70B Q4 는 ~50 GB 사용. foundations.lesson4 의 냅킨 공식으로 계산한 숫자.
KV cache. Context window 의 모든 토큰의 key 와 value 를 들고 있는 자라는 버퍼. Context 길이에 선형, layer 당 곱셈 더해. 32k context 의 7B 모델의 KV cache 는 극단 케이스에서 모델 자체보다 클 수 있어.
Activation memory. 단일 forward pass 중 쓰는 transient 버퍼. 추론에선 cache 보다 작지만, 진짜.

더해서 macOS 가 OS 자체와 다른 앱들 위해 wired memory 예약. prod.lesson2 가 wired-memory 천장 자세히 다뤄. 이 레슨은 네가 통제하는 부분 — KV cache 크기와 측정법 — 에 집중.

맞는 방법으로 메모리 측정

MLX 가 두 도움 되는 원시형 노출 — mx.get_active_memory() (현재 할당된 GPU 메모리 byte) 과 mx.get_peak_memory() (마지막 reset 이후 peak, byte). 생성 주위 이걸 읽으면 모델 + KV cache + activation 이 정확히 얼마 소비했는지 말해줘.

Office Mac 의 1B Q4 demo 모델에 대해, 검증된 숫자:

로드 후 — ~663 MB active
짧은 생성 후 — ~663 MB active
200-토큰 생성 후 — ~663 MB active, ~685 MB peak

1B 모델의 KV cache 는 전형 context 길이에서 메모리 dominate 안 할 만큼 작아. 32k context 의 7B+ 모델로 scale up 하면 KV cache 가 weight 아니라 load-bearing 숫자 됨.

Wired-memory 천장, 짧게

macOS 는 GPU 가 unified memory 의 100% lock 하게 안 둬. 기본 iogpu.wired_limit_mb 가 GPU 의 접근 가능 메모리를 tier-의존 분수로 cap (Mac 모델 사이 다름). 192 GB Mac Studio 에선, 천장이 192 GB 살짝 아래지만 정확히 192 GB 아냐. 모델이 네 냅킨 계산에 fit 되는데 실제 generation 에 안 fit 되면, wired-memory cap 이 보통 이유. prod.lesson2 가 전체 진단.

이 모든 것에 대해 뭐 하나

Weight 와 KV cache 추정 후 unified memory 의 약 70% 에 fit 되는 모델 크기 골라 (foundations.lesson3 의 룰).
긴 context 엔, 더 작은-양자화 모델 써 — fp16 대신 Q4 — cache 위한 공간 남기게.
모델이 "fit" 한다고 선언 전에 mx.get_peak_memory() 로 측정. 냅킨 계산은 ±10% 신뢰; 실제 peak 은 놀랄 수 있음.
천장 hit 하면, system-레벨 조정 손 뻗기 전에 양자화 떨어뜨리거나 context 줄여.

Code

모델 로드와 생성 주위 메모리 측정·python

import mlx.core as mx
from mlx_lm import load, generate


def mem_mb():
    return mx.get_active_memory() / (1024 * 1024)


def peak_mb():
    return mx.get_peak_memory() / (1024 * 1024)


# Reset peak counter and measure baseline
mx.reset_peak_memory()
print(f"baseline                    : active {mem_mb():.1f} MB, peak {peak_mb():.1f} MB")

model, tok = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
print(f"after load                  : active {mem_mb():.1f} MB, peak {peak_mb():.1f} MB")

generate(model, tok, prompt="Hello.", max_tokens=10, verbose=False)
print(f"after short generate (10)   : active {mem_mb():.1f} MB, peak {peak_mb():.1f} MB")

generate(model, tok, prompt="Tell me a story:", max_tokens=200, verbose=False)
print(f"after longer generate (200) : active {mem_mb():.1f} MB, peak {peak_mb():.1f} MB")

# Verified output (2026-05-03, M3 Ultra Studio):
#   baseline                    : active 0.0 MB, peak 0.0 MB
#   after load                  : active 663.0 MB, peak ~675 MB
#   after short generate (10)   : active 663.0 MB, peak ~680 MB
#   after longer generate (200) : active 663.0 MB, peak ~685 MB

네 Mac 의 wired-memory 천장 살피기·bash

# How much unified memory total?
sysctl -n hw.memsize | awk '{ printf "Unified memory: %.1f GB\n", $1/1024/1024/1024 }'

# What's the GPU wired-memory cap (the hard ceiling for MLX kernels)?
sysctl iogpu.wired_limit_mb
# A value of 0 means "system default" — typically a fraction (e.g. ~75%) of total RAM.
# Non-zero values are explicit overrides; raising this is reckless on small Macs.

# Sample (M3 Ultra Studio, 512 GB):
#   Unified memory: 512.0 GB
#   iogpu.wired_limit_mb: 0      (using system default ceiling)

Exercise

네 Mac 에서 같은 1B Q4 모델로 메모리 측정 블록 돌려. 그 다음 모델을 mlx-community/Llama-3.2-3B-Instruct-4bit (또는 cache 된 어떤 3B Q4) 로 바꾸고 다시 돌려. 로드 후 active 메모리와 200-토큰 생성 후 peak 비교. 모델이 대략 3× 더 크니까 비율이 3× 에 가까워야 — 훨씬 더 높으면, KV cache 가 자기 주장 시작. 측정한 거 두 문장.