TGI: Pull, Run, Tune

~30 min · serving, tgi

Level 0스카우트

0 XP0/50 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

5 분 startup

TGI 가 Docker 이미지로 ship. 표준 레시피: 모델 고르고, 가중치 캐시용 볼륨 mount, 포트 8080 expose, --model-id 셋. 컨테이너가 첫 시작에 가중치 pull (느림), 그다음 재시작에 warm-start (빠름).

실제로 튜닝할 노브

--model-id — HF 레포 id 또는 로컬 path.
--quantize — awq | gptq | bitsandbytes | bitsandbytes-nf4 | fp8. 맞는 모델 variant 와 mix-and-match.
--max-concurrent-requests — 동시 in-flight 요청 수. 디폴트 conservative; 더 높은 throughput 위해 bump.
--max-input-length, --max-total-tokens — KV cache pre-allocate. 높음 = 요청당 더 많은 GPU 메모리 budget, 더 적은 동시 요청.
--num-shard — GPU 간 tensor-parallel shard.

공짜로 오는 엔드포인트

POST /generate, POST /generate_stream, POST /v1/chat/completions (OpenAI 호환), GET /info, GET /health, GET /metrics (Prometheus).

Code

한 명령으로 TGI 실행·bash

# Single-GPU 7B 모델
docker run --gpus all --shm-size 1g -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --quantize bitsandbytes-nf4 \
  --max-concurrent-requests 64

# 다른 터미널:
curl http://localhost:8080/info | python -m json.tool

OpenAI 호환 엔드포인트로서의 TGI·bash

# TGI 가 /v1/chat/completions 를 out of the box 노출.
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "tgi",
    "messages": [{"role":"user","content":"Hello"}],
    "max_tokens": 50
  }'

External links

Exercise

최신 TGI 이미지 pull. 1-3B instruct 모델을 --quantize bitsandbytes-nf4 로 실행. /info, /health, /metrics hit. OpenAI 호환 엔드포인트로 chat completion 보내. 컨테이너 stop, restart, 볼륨에서 가중치 warm-cache 검증.

Progress

Progress is local-only — sign in to sync across devices.

← Previous왜 전용 인퍼런스 서버 Next →vLLM: PagedAttention 과 Continuous Batching

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.