Adapter merge — 매번 LoRA 거치지 않고 추론

왜 merge 하나

학습 후 너는 두 별도의 것 가짐 — base 모델과 학습된 adapter. 추론 시간에 모델이 (W + A · B) · x 계산 — base weight + adapter 의 contribution, adapter 가 만지는 모든 layer 에 적용. 만진 layer 당, 토큰 당 추가 행렬 곱이 모든 generation 에 비용.

Fix 는 merge — 한 번 adapter 를 base weight 에 fold, W 가 이미 학습된 A · B 포함하는 새 모델 파일 생산. Merge 후 추론은 adapter overhead 없는 정상 forward pass. Unmerged setup 와 같은 품질; 토큰 당 비용 더 싸.

Merge 명령 — mlx_lm.fuse

한 명령. Base 모델과 adapter 디렉토리 가리켜; mlx_lm.load 준비된 새 모델 디렉토리 받음.

Merge 로 포기하는 것

Merged 모델은 정적 — adapter 가 구워짐, swap out 못 해. 워크플로가 다른 작업 위한 여러 adapter 포함하면 (SQL 위해 하나, haiku 위해 하나, 요약 위해 하나), 별도 유지하고 demand 에 로드하는 게 fuse 보다 더 유연. 모델 + 이 specific adapter 가 배포 artifact 라고 결정했을 때만 fuse.

Fused 모델의 디스크 비용은 base 모델과 대략 같음 — 새 parameter 추가 안 해, adapter 의 contribution 을 기존 weight 에 fold 할 뿐.

옵션 GGUF export

mlx_lm.fuse 가 --export-gguf 로 fused 모델을 GGUF 형식으로도 export 가능. 이게 "MLX 에서 fine-tune" 에서 "비-Apple 머신의 llama.cpp/Ollama 에서 도는" 로의 가장 깨끗한 길 — fuse 먼저, 그 다음 export. 흔한 길 아님, 근데 결과를 비-Mac 배포 타겟에 출하해야 할 때 유용.

옵션 HF 업로드

--upload-repo your-username/repo-name 가 merged 모델을 Hugging Face repo 에 push, mlx_lm.convert 의 upload flag (convert 의 lesson 2) 같이. 다른 사람들이 repo id 로 네 fine-tune 을 mlx_lm.load 할 수 있게 하고 싶을 때 써.

Code

Adapter 를 base 에 fuse — 배포 가능 모델 생산·bash

# Folds ./my-adapter into the base model, writes the result to ./my-fused.
python -m mlx_lm fuse \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --adapter-path ./my-adapter \
  --save-path ./my-fused

# After this, ./my-fused looks like a normal MLX model directory:
#   config.json  model.safetensors  tokenizer.json  ...
# And you can load it without referencing the adapter at all:
python -c "from mlx_lm import load, generate; m, t = load('./my-fused'); print(generate(m, t, prompt='hi', max_tokens=20))"

Fuse + 자기 Hugging Face repo 에 push (한 명령)·bash

# Requires `huggingface-cli login` first.
python -m mlx_lm fuse \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --adapter-path ./my-adapter \
  --save-path ./my-fused \
  --upload-repo your-username/Mistral-7B-Instruct-v0.3-MyDomain-4bit

# Others can then pull your fine-tune by repo id:
#   model, tok = load("your-username/Mistral-7B-Instruct-v0.3-MyDomain-4bit")

Fuse + GGUF 로 export (비-Apple 배포용)·bash

# Less common path: cross-format export so the result can run in llama.cpp / Ollama
# on Linux/Windows machines.
python -m mlx_lm fuse \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --adapter-path ./my-adapter \
  --save-path ./my-fused \
  --export-gguf \
  --gguf-path ./my-fused.gguf

# ./my-fused.gguf is now a single file you can copy to any GGUF-capable runtime.

Exercise

이 트랙 앞에서 학습한 adapter 가져와 mlx_lm.fuse 로 배포 가능 모델에 fuse. (a) 추론에 별도 로드된 base + adapter 와 (b) fused 모델 사이 generation 시간 재. Fused 모델이 모든 layer 의 adapter 수학 없으니 토큰 당 약간 더 빨라야. Latency 차이와 사용 케이스에 adapter swappability 손실 가치 있다고 고려할지 두 문장.