vLLM과 throughput

~22 min · serving, vllm, throughput

Level 0Downloader

0 XP0/41 lessons0/11 achievements

0/120 XP to next level120 XP to go0% complete

vLLM이 하는 거

vLLM은 continuous batching이랑 PagedAttention 중심으로 만들어졌어. 핵심: 동시 request 많을 때 vLLM이 batch로 packing해서 GPU compute랑 메모리 request 간에 공유 — single-request engine보다 dramatically 높은 throughput.

vLLM 잡을 때

내부 API serving하는데 같은 모델에 사용자 여러 명 hit.
천 단위 prompt에 batch job.
다중 GPU NVIDIA 하드웨어 가지고 카드 간 sharded inference.

Mac 단일 사용자면 vLLM은 overkill. Ollama가 깔끔.

잃는 거

Apple Silicon 지원 부분적. vLLM은 NVIDIA-first. 일부 path가 Mac에서 동작 (CPU-only, 느림)하는데 production에선 절대 Mac에 vLLM 안 잡아.
더 무거운 setup. Python deps, CUDA 버전, 모델 path — Ollama보다 knob 더 많음.
다른 메모리 모델. vLLM이 contiguous KV-cache pool을 미리 할당. Trade-off는 빠른 scheduling인데 대신 사이즈 잘 잡아야 함.

Code

Ollama 모델 하나로 vLLM 실행 (NVIDIA host)·bash

# vLLM 설치 (CUDA 12.x랑 GPU 가정)
pip install vllm

# OpenAI-compat으로 port 8000에 HuggingFace 모델 serve
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192

# Test
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role":"user","content":"Hello!"}],
    "stream": false
  }'

Throughput test — 동시 request 많이·python

import asyncio, httpx, time

async def hit(client: httpx.AsyncClient, i: int):
    r = await client.post("http://localhost:8000/v1/chat/completions",
                          json={"model": "Qwen/Qwen2.5-7B-Instruct",
                                "messages": [{"role": "user",
                                              "content": f"Count to 10 — request #{i}"}],
                                "stream": False, "max_tokens": 80},
                          timeout=120.0)
    return r.json()["choices"][0]["message"]["content"]

async def main():
    async with httpx.AsyncClient() as c:
        t0 = time.time()
        # 50 동시 request
        results = await asyncio.gather(*(hit(c, i) for i in range(50)))
        dt = time.time() - t0
        print(f"50 requests in {dt:.1f}s → {50/dt:.2f} req/s")

asyncio.run(main())

External links

Exercise

NVIDIA 하드웨어에서 vLLM을 모델 하나로 돌려서 50 request burst test 돌려. 같은 모델 Ollama랑 req/s 비교. (NVIDIA 없으면 어떤 workload면 vLLM 잡을지, 지금 실제론 어떤 engine 쓰는지 200단어 노트 써.)

Progress

Progress is local-only — sign in to sync across devices.

← Previousllama.cpp server Next →OpenAI 호환성 test

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.