여러 모델, batch inference, 메모리 공유

비슷해 보이지만 안 그런 두 패턴

단일-모델-단일-request demo 너머 가면 두 production 패턴 옴. 관련된 것처럼 보이지만 다른 거 최적화, 잘못된 거 사용은 리소스 낭비.

한 프로세스에 여러 모델 — 서비스가 여러 다른 모델 호스팅 (예 chat 용 7B + 작은 embedder + 이미지 이해용 VLM). 메모리가 프로세스 레벨에서 공유; Python-프로세스 당 overhead 절약하지만 모델 사이 unified memory 두고 경쟁.
한 모델용 Batched 추론 — 서비스가 같은 모델 위해 많은 동시 prompt 받고 batch 로 처리. Throughput 올라감; request 당 latency 도 올라감; trade 가 총 compute 효율.

이 레슨이 두 패턴과 각각이 맞는 호출일 때 다룸.

여러 모델 — 메모리 회계

한 프로세스에 두 모델 호스팅은 두 weight 둘 다 unified memory 에 동시에 산다는 뜻. foundations.lesson4 의 모델 당 냅킨 계산이 그냥 더해. 7B Q4 (~5 GB) + 작은 embedding 모델 (~1 GB) + 2B VLM (~2 GB) = ~8 GB resident, 더하기 현재 generating 하는 거의 KV cache. 단일-모델 케이스보다 여기서 회계 더 중요 — 어느 모델도 갑자기 자기 full 추론 예산 필요 못 감당.

동작하는 패턴 — startup 에 모든 모델 로드, resident 유지, request 종류 기반으로 맞는 모델에 request 라우팅. 동작 안 하는 패턴 — demand 에 로드, idle 에 unload — 모델 로드 latency (큰 모델엔 종종 초 단위) 가 어떤 개별 request 의 추론 latency dwarf.

Batched 추론 — throughput 이 latency 이길 때

같은 모델 위해 많은 동시 prompt 있으면, batch 하는 게 GPU dispatch 당 더 많은 토큰 처리하고 throughput-per-dollar 의미 있게 개선. Trade 가 request 당 latency — 각 request 가 첫-토큰 출력 보기 전에 batch window 닫히기 기다림.

mlx-lm 이 일부 PyTorch 서빙 stack 처럼 first-class batched-generation primitive 안 가져, 그래서 실용 패턴이 서비스 layer (FastAPI) 의 async request 수집, 그 다음 mlx-lm 통한 request 당 single-prompt 추론. 많은 동시 짧은 prompt 가 dominate 하는 워크로드엔, MLX 위에 빌드된 더 높은-레벨 서빙 stack 봐 (또는 mlx-lm 의 batching API 성숙하기 기다림).

의사결정 나무

단일 모델, 단일 사용자 — prod.lesson1 의 직진 FastAPI 서비스, batching 안 필요.
여러 모델, 한 프로세스, 저 concurrency — startup 에 다 로드, request 종류로 라우팅, availability 에 unified memory 쓴다는 거 받아들임.
단일 모델, 매우 높은 concurrency — batched 서빙 탐색 (Ollama 또는 custom batching wrapper); 더 높은 throughput 위해 더 높은 request 당 latency 받아들임.
여러 모델 그리고 높은 concurrency — 보통 단일-Mac 서빙을 outgrow 한다는 시그널; 단일-목적 Mac 의 fleet 가 더 경제적인지 평가.

Code

한 FastAPI 프로세스의 여러 모델·python

# Sketch of a multi-model service (extends prod.lesson1's pattern).
from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
from mlx_lm import load, generate

models: dict = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    print("Loading models...")
    models["chat"], models["chat_tok"] = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
    models["small"], models["small_tok"] = load("mlx-community/Llama-3.2-1B-Instruct-4bit")
    # Warm up each
    for name in ["chat", "small"]:
        _ = generate(models[name], models[f"{name}_tok"], prompt="warmup", max_tokens=1)
    print("All models loaded and warm.")
    yield
    models.clear()

app = FastAPI(lifespan=lifespan)

class Req(BaseModel):
    model: str        # "chat" or "small"
    prompt: str
    max_tokens: int = 100

@app.post("/generate")
async def gen(req: Req):
    if req.model not in ("chat", "small"):
        return {"error": "unknown model"}
    text = generate(
        models[req.model], models[f"{req.model}_tok"],
        prompt=req.prompt, max_tokens=req.max_tokens, verbose=False,
    )
    return {"text": text, "model": req.model}

Async batch 수집 (lightweight 패턴)·python

# A thin batching pattern: collect prompts that arrive within a short window,
# process serially via mlx-lm. Real batched generation is more involved;
# this approximation is good enough when you control your traffic shape.
import asyncio
from collections import deque

class BatchCollector:
    def __init__(self, window_ms: int = 25):
        self.window = window_ms / 1000.0
        self.queue: deque = deque()
        self.lock = asyncio.Lock()

    async def submit(self, prompt: str):
        future: asyncio.Future = asyncio.get_event_loop().create_future()
        async with self.lock:
            self.queue.append((prompt, future))
        await asyncio.sleep(self.window)
        # In a full implementation, drain the queue and process as a batch.
        # The minimum sketch: serial inference, returning each result.
        return await future

Exercise

이 레슨의 multi-model FastAPI 스케치 가져와 로컬에서 돌려. 두 다른 prompt 로 hit, 하나는 chat 으로, 다른 하나는 small 로 라우팅. 각각의 request 당 latency 알아채 — 더 작은 모델이 눈에 띄게 더 빨라야. 그 다음 메모리 활용 체크 (Activity Monitor 또는 vm_stat) — 두 모델 다 resident 여야. Multi-model overhead 가 너에게 뭐 비용 들고 워크플로에 가치 있는지 두 문장.