API Wrapping & 모니터링

~20 min · fastapi, monitoring, drift, sli

Level 0관찰자

0 XP0/43 lessons0/11 achievements

0/120 XP to next level120 XP to go0% complete

앞에 FastAPI

vLLM/TGI 둘 다 API 노출하지만, 보통 그 앞에 얇은 앱 레이어 둬 — auth, rate limit, 요청 로깅, 출력 검증, 비즈니스 로직 위해. FastAPI가 정석 선택.

실서비스에서 모니터할 다섯 가지

Latency — P50, P95, P99 응답 시간. 모델 swap 후 회귀 주시.
품질 — 출력 샘플 떠서 주기적으로 LLM-as-judge 통과. 조용한 품질 드리프트 잡아.
드리프트 — 현재 입력 분포를 베이스라인이랑 비교. 사용자가 다른 질문 시작했어?
에러 — 잘못된 출력, 거절, 환각, 스키마 검증 실패에 로그 + 알림.
비용 — GPU 사용률, 달러당 요청, 고객당 비용. 달러 곡선 못 보면 flywheel 깨진 거.

Code

FastAPI wrapper around vLLM with structured response·python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
import time, logging

app = FastAPI()
log = logging.getLogger("my-ft-app")
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

class ChatRequest(BaseModel):
    message: str
    system_prompt: str = "You are a helpful assistant."

class ChatResponse(BaseModel):
    response: str
    model: str
    latency_ms: int

@app.post("/chat", response_model=ChatResponse)
async def chat(req: ChatRequest):
    start = time.perf_counter()
    try:
        result = client.chat.completions.create(
            model="my-finetuned-model",
            messages=[
                {"role": "system", "content": req.system_prompt},
                {"role": "user",   "content": req.message},
            ],
            temperature=0.7,
            max_tokens=1024,
        )
    except Exception as e:
        log.exception("upstream failure")
        raise HTTPException(502, str(e))
    latency_ms = int((time.perf_counter() - start) * 1000)
    log.info("chat ok", extra={"latency_ms": latency_ms,
                                 "prompt_len": len(req.message),
                                 "response_len": len(result.choices[0].message.content)})
    return ChatResponse(
        response=result.choices[0].message.content,
        model=result.model,
        latency_ms=latency_ms,
    )

External links

Exercise

위 FastAPI wrapper를 로컬 vLLM 인스턴스 앞에 띄워. 셋(latency, prompt_len, response_len)에 구조화 로깅 추가. 100개 혼합 프롬프트로 hit하고 로그 grep해서 P50/P95 latency랑 중간값 응답 길이 계산. 이게 출시 뒤에 깔 최소 운영 floor.

Progress

Progress is local-only — sign in to sync across devices.

← Previous클라우드 서빙: vLLM & TGI Next →Edge 배포 & 다중 Adapter 패턴

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.