Serving — FastAPI, TorchServe, vLLM

.pth 에서 HTTP endpoint 까지

operational 정교함 증가 순서로 세 serving 맛:

FastAPI + uvicorn — 너 자신 server 짜기. max 유연성, min 인프라. prototype, internal tool, 작은 production 에 great.
TorchServe — managed inference server (AWS + Meta). built-in batching, model versioning, metric, multi-model serving. 'production 에 많은 model' 의 옳은 선택.
vLLM / TGI / SGLang — LLM-specific serving. continuous batching, PagedAttention, optimized kernel. 언어 model serving 의 옳은 선택.

모든 serving setup 이 필요한 기본

startup 에 model 한 번 로드, request 별 X.
model.eval() 설정 + 매 request 에 torch.inference_mode() 사용.
latency budget 허락 시 request batch — 훨씬 더 나은 throughput.
request 안 CPU-GPU sync 회피 — buffer pre-allocate, non_blocking transfer 사용.
health endpoint 가 orchestrator 에게 service 살아있는지 알려줌.
metric endpoint (latency, throughput, error rate, GPU 활용도).

Code

FastAPI — 가장 단순 HTTP server·python

# pip install fastapi uvicorn[standard]
from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

app = FastAPI()

# Load once at startup
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tok = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
labels = ['NEGATIVE', 'POSITIVE']

class Req(BaseModel):
    text: str

class Resp(BaseModel):
    label: str
    score: float

@app.post("/predict", response_model=Resp)
@torch.inference_mode()
async def predict(req: Req) -> Resp:
    inputs = tok(req.text, return_tensors='pt', truncation=True, max_length=512).to(device)
    logits = model(**inputs).logits
    probs = logits.softmax(-1)[0]
    idx = int(probs.argmax())
    return Resp(label=labels[idx], score=float(probs[idx]))

@app.get('/health')
async def health(): return {'status': 'ok'}

# Run: uvicorn app:app --host 0.0.0.0 --port 8000

TorchServe — production-grade 옵션·python

# Bash, not Python — TorchServe is a separate process

# 1. Package model
# torch-model-archiver \
#     --model-name resnet50 \
#     --version 1.0 \
#     --model-file resnet_model.py \
#     --serialized-file resnet50.pth \
#     --handler image_classifier \
#     --export-path model_store \
#     --force

# 2. Start server
# torchserve --start --model-store model_store --models resnet50=resnet50.mar

# 3. Call API
# curl -X POST http://localhost:8080/predictions/resnet50 -T image.jpg

# Built-ins: batching, scaling, metrics, model versioning, multi-model
# Drawback: more setup vs FastAPI; opinionated about the handler interface

vLLM — LLM-shaped 옵션·python

# pip install vllm
# vLLM is the de facto LLM serving runtime — continuous batching, PagedAttention,
# OpenAI-compatible API.

# Start a server (CLI)
# python -m vllm.entrypoints.openai.api_server \
#     --model meta-llama/Llama-3.2-3B-Instruct \
#     --host 0.0.0.0 --port 8000

# Or programmatically
from vllm import LLM, SamplingParams

llm = LLM(model="gpt2")
sampling = SamplingParams(temperature=0.7, max_tokens=64)

prompts = ["Once upon a time", "The capital of France is"]
outputs = llm.generate(prompts, sampling)

for output in outputs:
    print(output.prompt, '→', output.outputs[0].text)

naive request batching — latency budget 허락 시·python

import asyncio
import torch
from collections import defaultdict
from fastapi import FastAPI

app = FastAPI()
queue = asyncio.Queue()

async def batched_worker():
    while True:
        # Pull up to 32 requests OR wait 20ms, whichever first
        first = await queue.get()
        batch = [first]
        try:
            for _ in range(31):
                more = await asyncio.wait_for(queue.get(), timeout=0.02)
                batch.append(more)
        except asyncio.TimeoutError:
            pass

        # Run model.forward on the whole batch
        inputs = [b['input'] for b in batch]
        with torch.inference_mode():
            outputs = model(inputs)              # batched call

        # Resolve each request's future
        for b, out in zip(batch, outputs):
            b['future'].set_result(out)

@app.on_event("startup")
async def start_worker():
    asyncio.create_task(batched_worker())

@app.post("/predict")
async def predict(text: str):
    fut = asyncio.get_event_loop().create_future()
    await queue.put({'input': text, 'future': fut})
    return await fut

Exercise

어떤 작은 classifier (HF distilbert, 너 TinyMLP, 뭐든) 든 FastAPI endpoint 에 wrap. uvicorn 으로 돌리기. /predict endpoint 에 몇 input 으로 curl 하고 response 검증. /health endpoint 추가. 첫 request 가 후속보다 얼마나 오래 걸리는지 시간 — cold-start 비용 봐.