MLX 를 FastAPI 서비스 뒤에 감싸기

왜 내장 server 에서 졸업하나

mlx-lm 의 내장 mlx_lm.server (lm.lesson6 에서 다룸) 가 개발, demo, 어떤 1-사용자, 저-concurrency 사용에 좋아. 진짜 동시 request 처리, request 큐잉, 인증, custom 로깅, 또는 "OpenAI-shape request 에 응답하는 모델" 너머 어떤 운영 우려든 필요한 순간, 통제하는 FastAPI 서비스에 mlx-lm wrap 하는 게 더 나아.

이게 복잡한 wrapper 아냐 — startup 에 한 번 모델 로드하고 /generate 와 /health endpoint 노출하는 약 100 줄 FastAPI 코드. 이김은 모든 운영 결정 소유하고 어느 거든 내장 server 의 기본과 싸우지 않고 swap, fork, 확장 가능.

패턴

Startup 에 모델 한 번 로드 — FastAPI 의 lifespan context manager 가 이거 처리. (model, tokenizer) 쌍이 app 상태에 살고, 모든 request 에 걸쳐 재사용.
Warm up — startup 중 throwaway forward pass 한 번. Fresh 로드 후 첫 추론 호출이 MLX 의 JIT 컴파일 비용 발생; warming up 은 첫 진짜 request 가 이 세금 안 낸다는 뜻.
/generate 노출 — prompt + sampling 파라미터 가진 JSON body 받음; 생성된 텍스트 돌려줌. 토큰-by-토큰 배달 위해 Server-Sent Events 통한 스트리밍 추가.
/health 노출 — load balancer 또는 프로세스 supervisor 가 서비스 살아 있는지 검증 위해 hit 할 싼 endpoint.
uvicorn 으로 실행 — MLX 엔 single process, single worker (concurrency 가 worker pool 이 아니라 async I/O 에서 와 — GPU 가 병목이고 worker 사이 효율적으로 parallelize 못 함).

안 하는 것

같은 모델에 대해 여러 worker 프로세스 돌리지 마 — 각자 자기 복사본 로드하고 GPU 메모리 두고 싸울 거. Framework 레벨에서 request 사이 generation parallelize 시도하지 마 — MLX 가 어쨌든 GPU 접근 serialize 하고, Python 의 GIL 더하기 async I/O 가 충분한 오케스트레이션. 어떤 정상 트래픽에든 FastAPI 레벨에 request 큐 추가하지 마 — uvicorn 의 연결 처리가 이미 큐.

최소 FastAPI 서비스

아래 코드 블록이 한 파일의 서비스 전체. app.py 로 저장, uvicorn app:app --host 0.0.0.0 --port 8000 로 실행, JSON body 로 POST /generate hit. Production 패턴 (rate limiting, auth, 구조화된 로깅) 이 일부러 부재 — 필요할 때 추가, 근데 뼈는 여기.

Code

MLX 용 최소 FastAPI 서비스 (app.py)·python

# Save as app.py; run with: uvicorn app:app --host 0.0.0.0 --port 8000

from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
from mlx_lm import load, generate

# Module-level cache for the loaded (model, tokenizer) pair
state: dict = {}


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: load the model once
    print("Loading model...")
    state["model"], state["tokenizer"] = load("mlx-community/Llama-3.2-1B-Instruct-4bit")

    # Warm up — pay the JIT cost during startup, not during the first real request
    _ = generate(state["model"], state["tokenizer"], prompt="warmup", max_tokens=1)
    print("Model loaded and warm.")

    yield   # App is alive here

    # Shutdown: nothing special needed; MLX cleans up at process exit
    state.clear()


app = FastAPI(lifespan=lifespan)


class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 100


@app.get("/health")
async def health():
    return {"status": "ok", "model_loaded": "model" in state}


@app.post("/generate")
async def gen(req: GenerateRequest):
    text = generate(
        state["model"], state["tokenizer"],
        prompt=req.prompt, max_tokens=req.max_tokens, verbose=False,
    )
    return {"text": text}

curl 로 hit·bash

# Health check
curl http://localhost:8000/health
# {"status":"ok","model_loaded":true}

# Generate
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Capital of France?", "max_tokens": 20}'
# {"text":"Paris is the capital of France..."}

Exercise

코드 블록의 FastAPI 서비스를 app.py 로 저장. 한 터미널에서 uvicorn app:app --host 0.0.0.0 --port 8000 실행; 다른 터미널에서 curl 로 /health 와 /generate hit. 첫 /generate 호출 vs 두 번째 timing — warmup step 때문에 거의 동일해야. 알아챈 거와 다음 추가할 feature 두 문장.