왜 √d_k로 나누나? 스케일링 인자의 정체

~10 min · scaled-dot-product, softmax

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

attention 공식의 1/√d_k는 장식이 아니야. 그게 없으면 큰 d_k가 softmax를 죽여.

이유는 이거. 차원 d_k인 두 랜덤 벡터의 dot product는 분산이 d_k에 비례해. 그래서 Q와 K가 기댓값으로 대략 단위 norm이면 Q · K의 magnitude가 √d_k 정도. d_k = 64면 약 8, d_k = 128이면 약 11. magnitude 8짜리 값들을 softmax에 넣으면 가장 큰 거 하나가 거의 완승 — softmax saturate, 다른 항목들이 거의 0, 거기 통과하는 gradient가 죽어.

√d_k로 나누면 점수 분산이 차원 상관없이 O(1) 유지. softmax가 Q, K 변화에 부드럽게 반응하는 유용한 분포를 생산하고, gradient가 흘러.

모던 Transformer에서 단순 상수 보정 하나가 "학습됨" vs "학습 안 됨"의 차이를 만드는 몇 안 되는 곳이야. 아키텍처가 scaled dot-product attention이라 불리는 이유이기도 하고.

Code

See the saturation yourself·python

import torch

d_k = 128
q = torch.randn(d_k)
k = torch.randn(50, d_k)              # 50 candidate keys

raw = k @ q                            # ~N(0, sqrt(d_k))
scaled = raw / d_k**0.5

print('raw weights (top 5):',
      torch.softmax(raw, 0).topk(5).values.tolist())
print('scaled weights (top 5):',
      torch.softmax(scaled, 0).topk(5).values.tolist())
# The unscaled top weight is often > 0.9 (saturated);
# the scaled top weight is typically 0.05-0.2 (useful).

External links

Vaswani et al. — sec. 3.2.1 explains the scaling

Exercise

작은 Transformer(2 layer, d_model=64)를 √d_k 스케일링 있는 거랑 없는 거 둘 다 copy task로 학습. loss 곡선 같은 축에 플롯. 스케일링 없는 버전은 더 일찍 plateau에 빠지거나 아예 학습 실패해야 해. 얼마나 더 일찍?

Progress

Progress is local-only — sign in to sync across devices.

← PreviousScaled dot-product attention — Q · K^T → softmax → · V Next →Multi-head attention — 병렬 subspace들

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.