위치 문제 — attention은 순열에 무관해

self-attention의 중요한, 놓치기 쉬운 성질 하나: 순열에 동변(equivariant)이야. 입력 토큰을 섞으면 attention 점수도 같은 식으로 섞여 — 관계는 보존되지만, 모델은 어떤 토큰이 첫 번째인지 두 번째인지 마지막인지 전혀 몰라.

RNN은 순서를 공짜로 가졌어 — hidden state가 한 스텝씩 업데이트되니까 위치가 dynamics에 암묵적으로 들어가. 트랜스포머는 그게 없어. 위치를 주입하는 명시적 메커니즘이 필요. 없으면 "the dog bit the man"이랑 "the man bit the dog"이 모델한텐 문자 그대로 동일 — 같은 토큰 집합, 같은 쌍관계 집합, 순서 없음.

해결책 세 계열

Sinusoidal positional encoding (2017 원조 논문): 입력 embedding에 고정된 sinusoid 패턴 더하기. 위치마다 다른 sin/cos 합, 모델이 읽는 법을 학습.
Learned positional embeddings (BERT, GPT-2): 0부터 max_len까지 위치별 embedding을 token embedding처럼 학습. 입력에 더해.
RoPE / ALiBi (모던 표준): 입력 layer 말고 attention 안에서 Q와 K에 직접 위치 주입. 외삽이 더 잘되고, 일부 변종은 추가 파라미터 없어.

다음 4개 lesson에서 자세히 다뤄. trivia 같지만 4K 컨텍스트 모델과 1M 컨텍스트 모델의 차이가 여기서 나.

Code

Demonstrating permutation equivariance·python

import torch
def attn(X):
    Q, K, V = X, X, X        # use embeddings directly for the demo
    s = Q @ K.T / X.shape[-1] ** 0.5
    return torch.softmax(s, -1) @ V

X = torch.randn(4, 8)         # 4 tokens, d=8
P = torch.tensor([[0,0,1,0],[1,0,0,0],[0,0,0,1],[0,1,0,0]]).float()
# Apply permutation: PX rearranges rows of X.
out_orig = attn(X)
out_perm = attn(P @ X)
# out_perm should equal P @ out_orig (permutation-equivariant).
print(torch.allclose(P @ out_orig, out_perm, atol=1e-5))  # True

위치 문제 — attention은 순열에 무관해

해결책 세 계열

Code

External links

Exercise

Progress

댓글 0