Health Check, Metric, 재시작 전략

뭘 scrape

TGI 와 vLLM 둘 다 configurable 포트에 Prometheus metric expose. key series:

tgi_request_count / vllm_requests_total — throughput.
tgi_request_inference_duration — 요청별 latency 히스토그램.
tgi_queue_size / vllm_num_requests_waiting — backlog. 지속적으로 > 0 면 alert.
tgi_batch_current_size — live batch fullness.
GPU 메모리: nvidia-smi-exporter 또는 DCGM exporter scrape.

Health check

둘 다 GET /health. 가중치 로드되면 200. k8s readiness probe 로; 실행 모델 id 가 너 config 가 live 라 생각하는 거랑 매치하는지 검증엔 GET /info.

재시작 전략

인퍼런스 서버가 long-tailed 워크로드에서 가끔 메모리 leak. Docker 에 --restart=unless-stopped 핀. k8s 면 restartPolicy: Always + 넉넉한 initial delay (모델 로드 분 단위 가능) 의 liveness probe. cold start 계획: 가중치를 이미지 또는 persistent volume 에 pre-load; 프로덕션 노드에서 절대 first-pull X.

Code

k8s readiness + liveness probe (TGI / vLLM)·yaml

# Full pod spec elided.
spec:
  containers:
  - name: tgi
    image: ghcr.io/huggingface/text-generation-inference:latest
    args: ["--model-id", "/models/llama-3.1-8b", "--port", "80"]
    ports: [{containerPort: 80}]
    readinessProbe:
      httpGet: {path: /health, port: 80}
      initialDelaySeconds: 30
      periodSeconds: 10
    livenessProbe:
      httpGet: {path: /health, port: 80}
      initialDelaySeconds: 300   # 넉넉히: 가중치 로드 느림
      periodSeconds: 30
      failureThreshold: 3

alert 치기 전 빠른 체크·bash

# metrics 엔드포인트 hit, key series grep
curl -s http://localhost:8080/metrics | egrep 'request_count|queue_size|batch_current_size' | head -20

Health Check, Metric, 재시작 전략

뭘 scrape

Health check

재시작 전략

Code

External links

Exercise

Progress

댓글 0