모델별 head 수 — 8부터 96까지

~8 min · heads, model-survey

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

모델마다 d_model이랑 설계 철학에 따라 head 수가 달라. 패턴: d_head가 64 또는 128로 안정화, d_model과 head 수는 같이 성장.

모델	d_model	Q heads	KV heads	d_head
Transformer Base (2017)	512	8	8	64
BERT-base	768	12	12	64
GPT-2	768	12	12	64
GPT-3	12,288	96	96	128
Llama 3 (8B)	4,096	32	8 (GQA)	128
Llama 3.3 (70B)	8,192	64	8 (GQA)	128
Mixtral 8×22B	6,144	48	8 (GQA)	128
Qwen 2.5-7B	3,584	28	4 (GQA)	128

추세는 명확: 모던 모델은 d_head=128에 GQA로 KV head를 작게 유지. Q head(표현 capacity)와 KV head(캐시 메모리)의 분리는 지난 3년 가장 영향력 있는 설계 결정 중 하나야.

Code

Read it from a model config·python

from transformers import AutoConfig

for name in ["meta-llama/Meta-Llama-3-8B",
             "meta-llama/Meta-Llama-3.1-70B",
             "mistralai/Mixtral-8x22B-v0.1"]:
    cfg = AutoConfig.from_pretrained(name)
    print(f"{name}")
    print(f"  d_model       = {cfg.hidden_size}")
    print(f"  Q heads       = {cfg.num_attention_heads}")
    print(f"  KV heads      = {getattr(cfg, 'num_key_value_heads', cfg.num_attention_heads)}")
    print(f"  d_head        = {cfg.hidden_size // cfg.num_attention_heads}")
    print()

External links

Exercise

production에서 서빙할 만한 open-weight 모델 5개의 config 가져와서 (d_model, n_heads, n_kv_heads, d_head) 표로 정리. 128K 컨텍스트에서 토큰당 KV 캐시가 가장 작은 모델은? 가장 큰 모델은? (힌트: KV-cache 크기는 n_kv_heads × d_head × n_layers에 비례.)

Progress

Progress is local-only — sign in to sync across devices.

← PreviousMulti-head attention — 병렬 subspace들 Next →Self-attention vs cross-attention

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.