End-to-End walkthrough — Fine-tune, merge, serve

한 곳의 전체 파이프라인

다섯 레슨의 조각들; 그것들을 함께 두는 한 레슨. 작은 도메인 데이터셋에 7B instruct base 를 fine-tune, adapter 를 배포 가능 모델에 merge, mlx-lm 의 HTTP server 통해 서빙, openai-python 에서 호출. 단일 32 GB Mac 의 end to end, GPU 빌릴 필요 없음.

전체 시퀀스

Base 골라. mlx-community/Mistral-7B-Instruct-v0.3-4bit — 7B Q4, ~5 GB unified memory 에 fit, 괜찮은 instruction following baseline.
데이터 준비. Chat 형식 (lesson 2) 의 작은 train.jsonl + valid.jsonl 빌드. 100-300 예제가 demo 엔 충분.
학습. Lesson 3 의 sane 기본으로 mlx_lm lora --train 실행. Val loss 봐; plateau 되면 멈춤.
Merge. Adapter 를 base 에 fold 위해 mlx_lm fuse 실행, 새 모델 디렉토리 생산.
서빙. 한 터미널에서 mlx_lm server --model ./my-fused 시작.
사용. 다른 터미널에서 openai-python 에서 server hit — fine-tune 된 모델이 이제 도메인의 voice 로 답.

이 운동이 너에게 남길 것

세 가지. 첫째, 다시 참고할 수 있는 동작 artifact — 파일 레이아웃, YAML config, 학습 명령, fuse 명령, server 명령. 둘째, 이 전체 파이프라인이 오후에 한 사람 한 Mac 이지, 팀-과-클라우드 프로젝트 아니라는 muscle memory. 셋째, iterate 할 수 있는 baseline — 다음 fine-tune 은 이 walkthrough 의 변주이지 처음부터의 rebuild 아냐.

Walkthrough 가 생산하는 artifact

./my-data/{train,valid}.jsonl — 큐레이트 된 데이터셋.
./my-lora.yaml — 학습 config (한 source of truth).
./my-adapter/ — 학습된 LoRA adapter (작은 파일, 실제 학습된 correction).
./my-fused/ — merged 배포 가능 모델 디렉토리.
OpenAI-호환 어떤 거에든 fine-tune 서빙하는 http://localhost:8080/v1 의 HTTP endpoint.

여기서 어디로

다른 사람들이 pull 할 수 있게 하고 싶으면 --upload-repo 로 fused 모델을 자기 mlx-community-style HF repo 에 push. 또는 계속 iterate — 도메인의 다른 슬라이스에 두 번째 adapter 학습하고 추론에서 비교. 또는 내장 server 가 한계 hit 했을 때 worker 관리 가진 진짜 FastAPI 서비스에 fine-tune wrap 하려고 prod.lesson1 로 jump.

Code

Step 1-2 — Base 고르기, 작은 데이터셋 준비·bash

mkdir -p my-data
# Hand-write at least 50 chat-format examples in train.jsonl
# and ~10 in valid.jsonl (lesson 2 has the format).
# You can also use a public dataset like mlx-community/wikisql for the demo.

Step 3 — LoRA adapter 학습·bash

# my-lora.yaml from lesson 3 contains all the flags as one source of truth.
python -m mlx_lm lora -c my-lora.yaml

# Or inline:
python -m mlx_lm lora \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --train --data ./my-data \
  --fine-tune-type lora --num-layers 16 \
  --batch-size 4 --iters 200 --learning-rate 5e-5 \
  --max-seq-length 2048 \
  --steps-per-eval 25 --save-every 100 \
  --adapter-path ./my-adapter

Step 4 — Adapter fuse; Step 5 — Serve·bash

# Fuse
python -m mlx_lm fuse \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --adapter-path ./my-adapter \
  --save-path ./my-fused

# Serve (in one terminal — keeps running)
python -m mlx_lm server --model ./my-fused

Step 6 — 클라이언트에서 fine-tune hit·python

# In a separate terminal:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

resp = client.chat.completions.create(
    model="./my-fused",
    messages=[
        {"role": "system", "content": "You are an expert in <your domain>."},
        {"role": "user",   "content": "<a question your fine-tune should answer well>"},
    ],
    max_tokens=120,
    temperature=0.7,
)
print(resp.choices[0].message.content)

Exercise

신경 쓰는 도메인에 전체 walkthrough 를 end-to-end 로 돌려. 시간의 70% 를 데이터에 (lesson 2), 20% 를 학습 돌리고 loss 보는 데, 10% 를 fuse + serve + test loop 에. 나중 한 명령으로 전체 거 다시 돌릴 수 있게 YAML config 저장. 운동은 파이프라인이 반복 가능하다고 느끼는 것 — 두 번째 도메인 fine-tune 은 첫 번째 시간의 절반 걸려야.