Speculative decoding 과 KV-cache 트릭은 architecture 아냐

Speculative decoding — serving 최적화

Speculative decoding 이 작은 "draft" 모델로 K candidate 토큰 빨리 생성, 그다음 더 큰 "verifier" 모델이 K 다 parallel 로 체크 — verifier 가 생산했을 prefix 받아들임. Output 이 standard decoding 과 수학적으로 동일. 더 적은 sequential 모델 호출에 extra compute 거래하는 순수 latency 최적화.

왜 새 architecture 아냐

Speculative decoding 이 어느 모델도 안 바꿔. 두 모델 lockstep 으로 실행하는 runtime 트릭. 둘 다 off-the-shelf 가능. "Speculation" 이 draft 토큰 언제 받아들이거나 거부할지 결정하는 orchestration 코드에 살아.

KV-cache attention variant — GQA, MQA, MLA

이것들이 attention layer 안 진짜 변화지만 새 architecture family 가 아니라 Transformer paradigm 안 refinement.

MHA (Multi-Head Attention) — 원조. 각 head 가 자체 K, V projection.
MQA (Multi-Query Attention) — 모든 head 가로질러 single K, V projection 공유. 가장 작은 KV cache, 가장 공격적.
GQA (Grouped Query Attention) — head 그룹화, 각 그룹 안 K, V 공유. MHA 와 MQA 사이 절충. Llama 3, Gemma, Qwen 다 GQA 사용.
MLA (Multi-head Latent Attention) — DeepSeek 의 variant. K, V 를 low-rank latent space 로 compress, demand 시 decompress. 더 작은 KV cache.

왜 serving 에 중요한데 taxonomy 에 안 중요

GQA, MQA, MLA 가 KV-cache 메모리 footprint 극적 감소 — long-context serving 과 multi-tenant inference 에 중요. 근데 기본 attention 패턴 (Query 가 Key attend, softmax similarity 로 weight, Value 에 적용) unchanged. 같은 architectural family 안의 engineering 선택.

읽기 룰

Paper 가 새 attention variant 도입하면 질문은 "기본 Q×K→softmax×V 패턴 여전히 유지?". Yes 면 attention refinement, 새 architecture family 아냐. No 면 진짜 non-attention 디자인 (Mamba, RWKV — Frontier track 에서 다룸) 보고 있는 거.

Code

Speculative decoding 골격·python

def speculative_decode(prompt, draft_model, verifier_model, k=4):
    output = []
    while not done(output):
        draft_tokens = draft_model.generate(prompt + output, max_new=k)
        verifier_logprobs = verifier_model.score(prompt + output, draft_tokens)
        accepted = []
        for tok, lp in zip(draft_tokens, verifier_logprobs):
            if accept(tok, lp):
                accepted.append(tok)
            else:
                accepted.append(verifier_model.sample_one(prompt + output + accepted))
                break
        output.extend(accepted)
    return output
# Output is identical in distribution to verifier_model.generate alone.

MHA vs GQA vs MQA — KV head 수·python

# MHA: num_kv_heads == num_q_heads (e.g., 32 query heads, 32 KV heads)
# GQA: num_kv_heads <  num_q_heads (e.g., 32 query heads,  8 KV heads in Llama 3)
# MQA: num_kv_heads == 1            (e.g., 32 query heads,  1 KV head)

Speculative decoding 과 KV-cache 트릭은 architecture 아냐

Speculative decoding — serving 최적화

왜 새 architecture 아냐

KV-cache attention variant — GQA, MQA, MLA

왜 serving 에 중요한데 taxonomy 에 안 중요

읽기 룰

Code

External links

Exercise

Progress

댓글 0