2017년 논문: 'Attention Is All You Need'

2017년 6월, 구글 브레인 팀(Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin)이 arXiv에 Attention Is All You Need를 올렸어. 제목이 곧 thesis야 — 순환도 필요 없고 convolution도 필요 없고, self-attention + 적절한 scaffolding만 있으면 된다.

논문이 도입한 네 가지 아이디어는 지금 모던 AI의 표준 어휘가 됐어:

Scaled dot-product attention. 1/√d_k 정규화 — 고차원에서 softmax가 saturate 안 되게 잡는 trick.
Multi-head attention. attention 연산을 h개의 저차원 subspace로 병렬 실행하고 concat. 각 head가 다른 관계 패턴을 학습해.
Positional encoding. sinusoid로 위치 정보를 주입해서, 아키텍처의 permutation 대칭성은 안 깨면서도 순서를 보존.
residual block 구조. self-attention sub-layer + feed-forward sub-layer, 각각 residual + layer norm으로 감싸는 unit cell — 이게 모든 현대 트랜스포머의 단위 세포야.

영수증

원조 encoder-decoder Transformer는 WMT14 EN-DE에서 BLEU 28.4, EN-FR에서 41.8 — 당시 SOTA, 비교 가능한 모델들의 학습 시간의 일부만 써서. base 모델: encoder 6 + decoder 6 layer, d_model=512, h=8, d_ff=2048, 약 65M 파라미터. big 모델: 약 213M. LLaMA 3.3(70B, 80 layer, d_model=8192)이랑 나란히 놓으면 귀엽지만 — unit cell은 똑같아.

Code

Original Transformer hyperparameters·python

# Attention Is All You Need — base model
config_base = {
    'd_model': 512,
    'd_ff': 2048,        # 4 × d_model
    'n_heads': 8,
    'd_k': 64,           # d_model / n_heads
    'n_layers': 6,       # both encoder and decoder
    'dropout': 0.1,
    'vocab_size': 37000, # WMT14 EN-DE BPE
    'max_len': 5000,
}

config_big = {**config_base,
    'd_model': 1024, 'd_ff': 4096, 'n_heads': 16, 'dropout': 0.3,
}

Exercise

원조 논문을 노트북 열어두고 끝까지 읽어 봐. 네 혁신(scaled dot-product, multi-head, positional encoding, sub-layer residual) 각각에 대해, 논문의 수식 그대로 PyTorch 셀 하나씩 만들어서 d_model=512, n_heads=8, seq_len=10일 때 출력 shape 검증. 이 노트북 챙겨놔, 퀘스트 내내 ref card로 쓸 거야.

2017년 논문: 'Attention Is All You Need'

영수증

Code

External links

Exercise

Progress

댓글 0