mlx_lm.convert — Hugging Face 모델을 MLX-native 로

왜 변환하나

Hugging Face 의 대부분 LLM 은 PyTorch 의 safetensors 형식으로, PyTorch-shape weight 이름으로 저장됨. MLX 는 같은 weight 원하는데 MLX-shape 이름으로, 그리고 (자주) 양자화 적용. mlx_lm.convert 가 Hugging Face 모델 읽고 mlx_lm.load 준비된 디렉토리 쓰는 도구.

Hugging Face 의 mlx-community org 가 미리-변환된 모델로 가득 — 인기 base 모델들엔 보통 직접 아무것도 변환할 필요 없어. 변환하는 때는 (1) 원하는 모델이 mlx-community 에 없을 때, (2) 아무도 안 올린 specific 양자화 원할 때, (3) 모델 fine-tune 했고 결과 출하하고 싶을 때.

단일 명령

CLI 가 one-liner. Hugging Face 모델 pull, 변환 (가는 길에 옵션으로 양자화), 디스크에 쓰기:

알 가치 있는 flag 들

--hf-path (alias --model) — Hugging Face repo id (예 meta-llama/Llama-3.2-1B-Instruct) 또는 로컬 디렉토리.
--mlx-path — MLX-format 출력 쓸 곳. 현재 디렉토리 아래 합리적 path 가 기본.
-q / --quantize — 변환 중 weight 양자화. 이 플래그 없으면 원래 정밀도 (전형적으로 bf16/fp16).
--q-bits — weight 당 bit (전형적으로 4 또는 8). -q 설정되면 기본 4.
--q-group-size — 양자화의 세분도. 작은 group size 가 더 좋은 품질이지만 더 큰 파일. 흔한 값 32, 64, 128.
--q-mode — affine (클래식), 또는 맞는 칩의 하드웨어-가속 추론용 새 MX 형식 중 하나 (mxfp4, nvfp4, mxfp8).
--quant-predicate — Mixed-precision 레시피 (예 mixed_3_4 = 일부 layer 3-bit, 다른 거 4-bit). Lesson 4 가 다뤄.
--dtype — 양자화 안 된 weight 에, 저장할 dtype (float16, bfloat16, float32).
--upload-repo — 설정되면, 변환된 모델을 Hugging Face repo 에 push. huggingface-cli login 필요.
-d / --dequantize — 반대 방향. 양자화된 MLX 모델을 full-precision weight 로 되돌림.
--trust-remote-code — custom modeling 코드 가진 일부 모델에 필요 (조심해서).

알아야 할 silent-fail 함정

Source 모델의 config.json 이 mlx-lm 이 필요한 필드 누락 (예 특이한 rope-scaling config, 또는 비표준 model_type 값) 이면, 변환은 완료될 수 있지만 나중에 혼란스러운 key 에러로 로딩 실패. 방어는 신뢰 전에 변환된 모델 test-load — python -c "from mlx_lm import load, generate; m, t = load('./my-converted'); print(generate(m, t, prompt='hi', max_tokens=5))". 5 초; 한 시간 절약.

Code

HF 에서 작은 Llama 모델 변환, Q4 로 양자화 (스케치 — 동작하지만 mlx-quest demo 에서 건너뛰기)·bash

# This downloads the original model from meta-llama (requires HF auth + license accept)
# and writes a Q4 MLX-format copy locally. Don't run unless you actually want to convert.
python -m mlx_lm convert \
  --hf-path meta-llama/Llama-3.2-1B-Instruct \
  --mlx-path ./Llama-3.2-1B-Instruct-MLX-Q4 \
  --quantize \
  --q-bits 4 \
  --q-group-size 64

# After conversion, the local directory looks just like the inspected files in lesson 1:
ls ./Llama-3.2-1B-Instruct-MLX-Q4/
# config.json  model.safetensors  tokenizer.json  tokenizer_config.json  ...

빠른 smoke test — 실제로 로드되고 생성하나?·python

# Always test-load before trusting a freshly converted model.
from mlx_lm import load, generate

model, tok = load("./Llama-3.2-1B-Instruct-MLX-Q4")
print(generate(model, tok, prompt="Hi.", max_tokens=10, verbose=False))

# If this raises a key error, the conversion's config.json is missing something
# the loader expects — diff against a known-working mlx-community config.

옵션 — 변환된 모델을 자신의 HF repo 에 push·bash

# Requires `huggingface-cli login` first.
python -m mlx_lm convert \
  --hf-path meta-llama/Llama-3.2-1B-Instruct \
  --mlx-path ./Llama-3.2-1B-Instruct-MLX-Q4 \
  --quantize --q-bits 4 \
  --upload-repo your-username/Llama-3.2-1B-Instruct-MLX-Q4

# The upload includes a generated README with the source repo,
# the conversion command, and the quantization config — so others
# can reproduce or audit the conversion.

Exercise

mlx-community 에 없는 작은 instruct 모델 골라 (또는 custom 양자화로 갖고 싶은 어떤 모델). mlx_lm.convert 로 로컬에서 Q4 와 Q8 변환 — 두 출력 디렉토리 만들기. 그들의 디스크 크기 비교, 같은 프롬프트로 각각에 빠른 generate. 크기 비율 (Q8 가 Q4 의 디스크 크기의 약 2×) 알아채고 출력이 품질에서 의미 있게 다른지 알아채. 발견한 거 두 문장.