왜 전용 인퍼런스 서버

pipeline() 은 프로토타이핑용

pipeline() 콜이 요청마다 re-tokenize, 요청마다 allocate, KV cache 요청 간 공유 X, batching 컨트롤 X. 사용자 한 명엔 OK. 동시 사용자 열 명엔 단일 GPU 두고 싸우는 pipeline 객체 stack.

인퍼런스 서버가 추가하는 거

Continuous batching — 새 요청이 매 step in-flight batch 에 합류, 이전 거 끝나기 기다리지 않음.
PagedAttention / KV cache 관리 — 엔진이 attention tensor page-fault, GPU 메모리에 짧은 대화 많이 들어감.
Streaming — 첫 서버 응답부터 SSE 또는 token-by-token.
Serve 타임 양자화 — AWQ / GPTQ / bnb / fp8 / int4 startup 시 선택 가능.
Health, metrics, model 관리 — /health, /metrics, /info 엔드포인트 baked in.

스택 둘: TGI 와 vLLM

HF first-party 서버는 text-generation-inference (TGI). 커뮤니티 표준은 vLLM. 겹쳐; 둘 다 알 가치. TGI 먼저 다룰게 — Hub 와 OpenAI 호환 router 와 자연스럽게 compose.

Code

비교: 동시 사용자 10 명에 pipeline vs TGI·bash

# Pipeline 접근: Python 프로세스 하나, 순차
# 단일 GPU 7B 모델 throughput top out 보통 2-5 req/s

# TGI 접근: TGI 프로세스 하나, continuous batching
# 같은 하드웨어 같은 워크로드 throughput: 30-100 req/s

# 예시지 prescriptive 아님 — 너 하드웨어에서 측정해.

httpx 로 concurrency spot 테스트·python

import httpx
import asyncio

async def hit(client, prompt):
    r = await client.post(
        "http://localhost:8080/generate",
        json={"inputs": prompt, "parameters": {"max_new_tokens": 50}},
        timeout=60,
    )
    return r.json()

async def main():
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*[hit(client, "ping") for _ in range(20)])
        print(len(results), results[0])

asyncio.run(main())

pipeline() 은 프로토타이핑용

인퍼런스 서버가 추가하는 거

스택 둘: TGI 와 vLLM

Code

External links

Exercise

Progress

댓글 0