Attention Breakthrough

한 문장의 attention

Attention 이 value vector 의 weighted average 계산, weight 는 query vector 를 key vector set 과 비교해서 와. 각 output position 이 자기에 중요한 input 에 'attend', weight 는 backprop 통해 end-to-end 학습.

Self-attention 은 query, key, value 가 다 같은 sequence 에서 오는 special case. 이제 모든 token 이 모든 다른 token 을 한 step 에 직접 볼 수 있어 — recurrence 없음, fixed-size hidden state 없음. Long-range dependency 가 trivially expressible.

팁: 이 quest 전체에서 paper 한 편 읽으면 'Attention Is All You Need' (Vaswani et al., 2017) 로 해. 8 page 에 모든 modern LLM 을 power 하는 architecture 설명.

한 attention head 의 math

Q, K, V (각 shape [B, T, d_k]) 주어지면: attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V. sqrt(d_k) 가 dot product 가 dimension 과 너무 커지는 거 막아. Softmax 가 dot product 를 input token 위 probability distribution 으로. 결과가 [B, T, d_k] — input 과 same shape.

Multi-head attention

한 attention head 가 한 종류 관계 학습. Multi-head attention 이 많은 head 를 parallel 로 (각자 자기 Q, K, V projection) 돌리고 결과 concat. 직관: 다른 head 가 다른 관계 type (syntax, coreference, position, semantic similarity) 을 weight set 하나에 경쟁 없이 학습 가능.

왜 attention 이 parallelizable 한가

RNN 과 달리 attention 이 모든 position 의 output 을 simultaneously 계산 가능 — 모든 dot product 가 independent. GPU 에서 transformer 가 RNN 보다 훨씬 빨리 train 한 이유.

원칙: Attention 이 지난 10 년의 가장 중요한 architectural primitive. Forward pass 가 정확히 뭐 계산하는지 (Q, K, V → softmax → V 의 weighted sum) 아는 게 모든 modern LLM 이해의 foundation.

Code

Scaled dot-product attention from scratch·python

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q, K, V: [B, H, T, d_k]
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)   # [B, H, T, T]
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float("-inf"))
    weights = F.softmax(scores, dim=-1)                 # [B, H, T, T]
    return weights @ V                                  # [B, H, T, d_k]

# PyTorch has a fused, fast version too
out = F.scaled_dot_product_attention(Q, K, V, attn_mask=None)

Multi-head attention via nn.MultiheadAttention·python

import torch
import torch.nn as nn

mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)
x = torch.randn(2, 16, 512)            # [B, T, d_model]
out, weights = mha(x, x, x)            # self-attention: q=k=v=x
print(out.shape)                       # [2, 16, 512]
print(weights.shape)                   # [2, 16, 16] averaged over heads

한 문장의 attention

한 attention head 의 math

Multi-head attention

왜 attention 이 parallelizable 한가

Code

External links

Exercise

Progress

댓글 0