KV-Cache 와 추론 비용

KV-cache 가 long-context 추론을 비싸게 만들어

Quadratic cost 는 학습 쪽 얘기야. 추론 쪽 얘기는 KV-cache: autoregressive generation 중에 이미 처리한 모든 토큰의 key/value tensor 를 캐싱해서 새 토큰마다 재계산 안 해. cache 는 context length 따라 linear 로 자라고, long context 에서는 GPU 메모리를 dominate 해.

숫자가 잔인해. 70B 파라미터, 64 attention heads, head dim 128, 80 layer, fp16 serving 모델은 context 토큰 당 대략 2 × n_layers × n_heads × head_dim × 2 bytes 를 써. 128K 토큰이면 40+ GB just for cache — model weight 위에. H100 한 장에 user session 하나 들어갈까말까. Batching 무너져.

GQA 와 MQA — KV memory 되사기

Grouped Query Attention (GQA, Llama 2/3 가 사용) 은 K, V head 를 query head 그룹들 사이에서 공유. Llama 2 70B 는 64 query head 에 8 KV head — 8× KV-cache 감소, quality 손실 거의 0. Multi-Query Attention (MQA) 은 끝까지 가: KV head 1 개, N× 감소, 어려운 task 에서 측정 가능한 quality 손실. 대부분의 modern frontier model 이 GQA 를 default 로 써.

둘 다 여전히 context 따라 linear 로 자라. constant 를 줄였지 — slope 를 안 바꿨어. 그래서 GQA 써도 128K context 모델 serving 이 8K 대비 여전히 request 당 약 3–5× 비싸. 이 비율이 모든 long-context provider 의 API pricing tier 와 long-context tax 에 그대로 나타나.

Code

흔한 모델 모양의 KV-cache 크기 (GB)·python

def kv_cache_gb(seq_len, n_layers, n_kv_heads, head_dim, dtype_bytes=2):
    # Per token: 2 (K and V) * layers * heads * head_dim * dtype_bytes
    bytes_per_tok = 2 * n_layers * n_kv_heads * head_dim * dtype_bytes
    return seq_len * bytes_per_tok / (1024 ** 3)

# Llama 3.1 70B 는 GQA: 8 KV heads, 80 layers, head_dim 128
for n in [8_000, 32_000, 128_000, 1_000_000]:
    print(f'{n:>9} tok -> {kv_cache_gb(n, 80, 8, 128):>6.2f} GB KV-cache')
# 8000   -> 2.50 GB
# 32000  -> 10.00 GB
# 128000 -> 40.00 GB
# 1000000 -> 312.50 GB

Exercise

배포할 가능성 있는 세 모델 family 의 토큰 당 KV-cache 크기 계산해봐: Llama 3.1 8B (GQA, 8 KV heads, 32 layers, head_dim 128), Llama 3.1 70B (8 KV heads, 80 layers, 128), 가상의 "MHA 70B" (64 KV heads, 80, 128). 각각에 대해 4K 부터 1M 까지 context length 에서 KV-cache 만의 GPU 메모리 그래프 그려. 그리고 단일 H100 80GB 에 더 안 들어가는 line 을 표시. 이 그래프가 모든 serving team 이 벽에 붙여놓는 그림이야.

KV-cache 가 long-context 추론을 비싸게 만들어

GQA 와 MQA — KV memory 되사기

Code

External links

Exercise

Progress

댓글 0