mlx-lm 5분 — 로드, 프롬프트, 끝

왜 이 레슨이 일부러 짧은가

mlx-lm 이 동작한다는 걸 가장 빠르게 내면화하는 방법은 동작하는 걸 보는 것. 새로 깔 거 없고 (core.lesson1 에서 이미 mlx-lm 가졌어), 신뢰받는 mlx-community Hugging Face org 에서 작은 instruct 모델 pull 하고, generate 호출. 세 줄, 커피 한 잔, 네 Mac 에서 진짜 LLM output.

이 트랙의 나머지가 방금 일어난 모든 조각을 풀어. 이 레슨은 "잠깐, 그게 다야?" 순간 — 풀어낼 자격을 얻는 거.

루프 시연하는 작은 모델 골라

이 레슨에서 mlx-community/Llama-3.2-1B-Instruct-4bit 써. 이유:

작음 (~700 MB) — 보통 연결에서 초 단위 다운로드, 어떤 M 시리즈 Mac 의 unified memory 에도 fit.
Instruct-tuned — 네 프롬프트에 답해, 텍스트 그냥 잇지 않아. Demo 가 legible.
mlx-community 에서 — 이미 MLX-format, 이미 양자화됨, 이미 foundations.lesson6 에서 의논한 community 큐레이션이 vet 함.
4-bit 양자화 — Demo 안 잃고 가장 작은 합리적 풋프린트로 foundations.lesson4 의 냅킨 계산에 fit.

그걸 하는 두 줄

아래 코드 블록이 load-and-generate 루프 전체. load() 가 첫 호출에 모델 다운로드 (이후 캐시되어 다음 실행은 즉시) 하고 model + tokenizer 쌍 돌려줘. generate() 가 프롬프트 받고 completion 을 Python 문자열로 돌려줘.

방금 일어난 일, 짧게

HF 다운로드 또는 캐시 히트 — 첫 호출이 huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit 에서 ~/.cache/huggingface/hub/ 로 pull. 다음 실행은 디스크에서 읽어.
Weight 가 unified memory 로 mmap — safetensors shard 가 mmap 됨; MLX kernel 이 그 mapped region 에서 직접 읽어.
Generate 가 token-by-token 으로 돌아 — 내부적으로 generate 가 stream_generate (lesson 2) 를 감싸고 토큰 텍스트들을 연결. 각 forward pass 가 GPU 의 kernel dispatch; KV cache 가 가면서 쌓여 (lesson 7).

그게 파이프라인 전부. 이 트랙의 다음 여섯 레슨이 각 조각을 자세히 봐 — 스트리밍, 샘플링, architecture, chat template, 내장 HTTP server, 메모리 동작. 이 레슨의 포인트는 dissect 할 자격 얻기 전에 파이프라인이 존재하고 동작한다는 걸 아는 것.

Code

두 줄 — load + generate·python

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
print(generate(model, tokenizer, prompt="Say hello in one short sentence.", max_tokens=30))

# Verified output (2026-05-03, mlx-lm 0.31.3):
#   '"Hello, how are you?" is a simple yet effective greeting that can start
#    a conversation and break the ice.\n\nHere are some more ideas for greetings'
#
# (Your exact output will vary — sampling defaults are non-deterministic
#  unless you pass a fixed seed. We'll fix the determinism story in lesson 3.)

첫 load 후 모델이 디스크 어디 사는지·bash

# After the first `load()` call, the model is cached here:
ls ~/.cache/huggingface/hub/models--mlx-community--Llama-3.2-1B-Instruct-4bit/snapshots/

# Check disk usage:
du -sh ~/.cache/huggingface/hub/models--mlx-community--Llama-3.2-1B-Instruct-4bit/

# Sample (verified 2026-05-03):
#   ~700 MB on disk for the 1B Q4 instruct variant

Exercise

두 줄 돌려. 진짜 completion 돌아오는 거 확인. 그 다음 같은 model 객체로 generate() 세 번 더 호출 — 다시 다운로드 없음, 다시 로드 없음, 매번 새 forward pass 만 있는 거 알아채. time.perf_counter() 로 첫 generation vs 세 번째 timing — 첫 번째가 MLX 의 첫 호출 JIT 워밍업 때문에 약간 느려. 알아챈 거 두 문장.