추론 최적화 — Speculative decoding, PagedAttention, continuous batching

모던 서빙 시스템은 프론티어 모델에서 쓸 만한 throughput 짜내려고 네 가지 주요 최적화 결합. 각각이 뭐 하는지 알기는 스택 선택과 성능 디버깅에 필수.

Flash Attention 2/3

Track 4에서 이미 다룸. attention을 타일링해서 (n × n) score 행렬이 HBM에 안 만들어지게. naive 대비 attention throughput 2-4배. FA3는 H100에서 FP8 + warp specialization 지원(~740 TFLOPs/s).

GQA / MQA

아키텍처 단위. query head 가로질러 K, V 공유로 KV cache 감소. Llama 3.3의 8 KV head(vs 64 Q head)가 cache 8배 축소. 더 적은 GPU 메모리에 긴 컨텍스트 수용 가능.

Speculative decoding

작은 draft 모델로 K개 후보 토큰 생성, target(큰) 모델의 단일 forward pass로 검증. 가장 긴 매칭 prefix 수용. 동일 출력 분포에 wall-clock 2-3배 속도. 구현: Llama-cpp, vLLM, MLC-LLM.

PagedAttention (vLLM)

KV cache를 OS virtual-memory 시스템처럼 관리 — 고정 크기 페이지, 논리에서 물리로 매핑하는 block 테이블, 공유 prefix용 copy-on-write. 단편화 낭비를 60-80%에서 ~4%로 감소, GPU당 동시 요청 수 극적으로 증가.

Continuous batching

Iteration 단위 스케줄링 — 매 디코딩 step에서 끝난 시퀀스 제거 + 새 요청 슬롯 채움. GPU idle 시간을 ~40%에서 10% 미만으로 감소. 정적 batching 대비 throughput 5-23배 개선.

기법	해결하는 것	속도
Flash Attention	attention의 메모리 대역폭	2-4배
GQA / MQA	KV cache 크기	cache 2-8배 감소
Speculative decoding	순차 decode 병목	wall-clock 2-3배
PagedAttention	메모리 단편화	GPU당 요청 ~20배
Continuous batching	GPU idle 시간	throughput 5-23배

Code

Serving with vLLM (PagedAttention + continuous batching)·python

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct",
          gpu_memory_utilization=0.9,
          max_model_len=8192)

# Speculative decoding (small draft model)
# llm = LLM(model="...", speculative_model="meta-llama/Llama-3.2-1B")

params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate([prompt for prompt in many_prompts], params)
# vLLM batches them dynamically with continuous batching;
# PagedAttention manages KV cache pages.
# At scale this is 10-20× faster than naive HuggingFace generate.

추론 최적화 — Speculative decoding, PagedAttention, continuous batching

Flash Attention 2/3

GQA / MQA

Speculative decoding

PagedAttention (vLLM)

Continuous batching

Code

External links

Exercise

Progress

댓글 0