Embedding 차원 (d_model) — 768에서 12,288까지

~10 min · d-model, scaling, capacity

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

d_model은 이 퀘스트에서 보게 될 가장 중요한 하이퍼파라미터야. 모든 내부 표현의 차원 — embedding 행, attention 출력, FFN 출력, 모든 layer를 통과하는 residual stream. d_model 크면 토큰당 capacity 커지고, attention 파라미터에선 제곱 비용, FFN 파라미터에선 선형 비용이 들어.

모델	d_model	레이어	총 파라미터
BERT-base	768	12	110M
GPT-2	768	12	117M
GPT-2-XL	1,600	48	1.5B
Llama 3 (8B)	4,096	32	8B
Llama 3.3 (70B)	8,192	80	70B
GPT-3	12,288	96	175B

왜 d_model 크면 도움이 되나: 각 차원이 의미의 다른 특징을 인코딩할 수 있어 — 문법적 역할, 감정, 격식 수준, 주제, named-entity 종류, 그리고 우리가 이름 못 붙인 수많은 것들. interpretability 연구(sparse autoencoder)에 따르면 모던 d_model=4096+ 표현 안에 superposition으로 압축된 별개 "feature"가 수만 개 있어.

Code

Capacity scaling — what doubling d_model costs·python

def transformer_param_estimate(vocab, d, n_layers, d_ff_mul=4):
    embed = vocab * d                     # input embedding
    attn  = 4 * d * d * n_layers           # Q, K, V, O per layer
    ffn   = 2 * d * (d_ff_mul * d) * n_layers   # 2 matrices per layer
    return embed + attn + ffn

# Doubling d_model from 4096 -> 8192 with 32 layers, vocab 128K:
small = transformer_param_estimate(128_000, 4_096, 32)
big   = transformer_param_estimate(128_000, 8_192, 32)
print(f"d=4096:  {small/1e9:5.2f}B params")
print(f"d=8192:  {big/1e9:5.2f}B params  (~{big/small:.1f}x)")
# Note: attention scales as d^2, FFN scales as d^2 (with mul=4).

External links

Exercise

open-weight 모델 셋의 config.json에서 아키텍처 설정 가져와(예: Mistral 7B, Llama 3 8B, Qwen 2.5-7B). d_model, n_layers, n_heads, vocab_size, intermediate_size 표로 정리. 차이는 어디에 있나? 어떤 설계 결정이 읽히나?

Progress

Progress is local-only — sign in to sync across devices.

← PreviousWord embedding — 토큰 ID가 dense vector가 되는 곳 Next →의미 산수 — king − man + woman ≈ queen

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.