MLX model format — safetensors + config + tokenizer

MLX 모델 디렉토리에 사는 것

Hugging Face 의 모든 MLX-format 모델 (또는 mlx_lm.convert 후 디스크의) 은 디렉토리 안의 작은 파일 셋. 불투명 blob 없음, 독점 컨테이너 없음 — 표준 도구로 살필 수 있는 모두 readable 형식.

model.safetensors (또는 shard — model-00001-of-00003.safetensors 등) — safetensors 형식의 weight. Mmap-able, tensor 이름으로 dedup, 빠른 로드.
model.safetensors.index.json — sharded 일 때만 존재; 각 tensor 이름이 어느 shard 파일에 있는지 매핑.
config.json — architecture descriptor. model_type, hidden size, layer count, attention head count, vocab size, quantization config (양자화 됐으면). mlx-lm 이 맞는 model class 로 dispatch 하려고 읽는 거.
tokenizer.json + tokenizer_config.json — tokenizer, special 토큰, 그리고 (중요하게) Jinja 문자열로 된 chat_template. lm 트랙 lesson 5 가 이것에 의존.
옵션 — 기본 sampling 파라미터 가진 generation_config.json, special_tokens_map.json, added_tokens.json 등. mlx-lm 이 필요한 거 읽고 나머지 무시.

왜 이게 중요한가

세 이유. 첫째, 가진 거 검증 가능. cat config.json 이 architecture 와 양자화 한 눈에 — model card 에서 추측할 필요 없음. 둘째, 조각 재사용 가능. 같은 tokenizer 파일들이 같은 가족의 모델 variant 들에 동작; 토큰화 다시 bundle 안 하고 weight swap 가능. 셋째, 로드 실패 디버깅 가능. "Weight key 못 찾음" 에러는 보통 safetensors 와 config 가 동의 안 한다는 뜻 — 둘 다 열고 봐.

Sharded vs single-file

작은 모델 (디스크에 ~5 GB 미만) 은 단일 model.safetensors 로 출하. 큰 모델은 shard 로 분할, model.safetensors.index.json 이 routing map. mlx-lm 이 둘 다 투명하게 처리 — 너가 shard 안 골라, loader 가 골라. Shard 크기 한계는 변환 중 설정 (전형적으로 shard 당 5 GB).

이 지식으로 뭐 하나

이 트랙의 나머지에서, 모든 레슨이 모델 디렉토리를 ls 하고 거기 뭐 있는지 알아본다고 가정. 로드하려는 어떤 모델에든 config.json 여는 거에 익숙해져 — 30 초에 "어떤 architecture, 어떤 dtype, 어떤 양자화, 어떤 context 길이" 답해.

Code

Cache 된 MLX 모델 디렉토리 살피기·bash

# Find the cached model
SNAP=~/.cache/huggingface/hub/models--mlx-community--Llama-3.2-1B-Instruct-4bit/snapshots/
ls $SNAP*/

# Sample listing (verified 2026-05-03):
#   config.json
#   model.safetensors
#   model.safetensors.index.json   (only if sharded; 1B Q4 fits in single file)
#   special_tokens_map.json
#   tokenizer.json
#   tokenizer_config.json

config.json 읽기 — architecture + 양자화 한 눈에·python

import json, os, glob

snap = glob.glob(os.path.expanduser(
    "~/.cache/huggingface/hub/models--mlx-community--Llama-3.2-1B-Instruct-4bit/snapshots/*/"
))[0]

with open(os.path.join(snap, "config.json")) as f:
    cfg = json.load(f)

print("model_type      :", cfg.get("model_type"))
print("hidden_size     :", cfg.get("hidden_size"))
print("num_hidden_layers:", cfg.get("num_hidden_layers"))
print("num_attention_heads:", cfg.get("num_attention_heads"))
print("vocab_size      :", cfg.get("vocab_size"))
print("max_position_embeddings:", cfg.get("max_position_embeddings"))
print("quantization    :", cfg.get("quantization"))   # group_size, bits, mode

Weight 로드 안 하고 safetensors metadata 살피기·python

import os, glob
from safetensors import safe_open

snap = glob.glob(os.path.expanduser(
    "~/.cache/huggingface/hub/models--mlx-community--Llama-3.2-1B-Instruct-4bit/snapshots/*/"
))[0]
shard = os.path.join(snap, "model.safetensors")

with safe_open(shard, framework="numpy") as f:
    keys = list(f.keys())
    print(f"Total tensors: {len(keys)}")
    print("First 8 tensor names:")
    for k in keys[:8]:
        print(f"  {k:60} dtype={f.get_slice(k).get_dtype()} shape={f.get_slice(k).get_shape()}")

Exercise

네 머신의 cache 된 MLX 모델 살펴. 그 config.json 읽고 받아 적어 — (1) architecture 이름, (2) 양자화 config (group_size, bits, mode), (3) max context 길이. 그 다음 tokenizer_config.json 열고 chat_template 필드 찾아 — 첫 200 자 복사. 운동은 "이 안에 뭐 있어?" 불안을 은퇴시키는 것 — 30 초 검사가 어떤 MLX 모델에든 답해.