Attention 패턴 — head가 실제로 배우는 것들

학습된 Transformer의 attention 가중치를 layer별, head별 시각화하면 head들이 놀랍도록 일관되게 특화되는 걸 발견해. 모델 계열을 가로질러 성립하는 interpretability 발견 중 하나야.

흔히 emergent하게 나타나는 head 종류

위치 head. 고정 offset에 항상 attend(가장 흔히 −1, encoder 모델에선 +1도). position encoding 처리 후에도 네트워크가 단어 순서를 "알고 있는" 방식.
문법 head. 문법 관계 따라 attend — 주어 → 동사, 형용사 → 수식하는 명사, 대명사 → 선행사.
Induction head(Anthropic, 2022). 모델이 컨텍스트 앞쪽에서 "A B"를 봤고 다시 "A"를 보면, induction head가 앞의 "B"에 정확히 attend. in-context learning과 복사 동작의 핵심 메커니즘.
의미 head. 거리 상관없이 주제적으로 관련된 토큰에 attend — 장거리 일관성에 유용.
희귀 토큰 head. 특이/중요 토큰(named entity, 저빈도 어휘)에 attention 집중.

아무도 이걸 손으로 코딩하지 않아. next-token prediction 학습에서 emergent. Anthropic의 "induction head" 논문은 이런 특화 head 일부가 장식이 아니라 — few-shot learning 같은 능력의 proximate cause라고 주장했어.

Code

Visualizing one head's attention·python

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("gpt2", output_attentions=True)
tok = AutoTokenizer.from_pretrained("gpt2")

inputs = tok("The cat sat on the mat", return_tensors='pt')
with torch.no_grad():
    out = model(**inputs)

# out.attentions: tuple of (B, n_heads, L, L) per layer
layer_idx, head_idx = 5, 3
attn = out.attentions[layer_idx][0, head_idx]   # (L, L)

import matplotlib.pyplot as plt
plt.imshow(attn, cmap='viridis')
labels = tok.tokenize(inputs['input_ids'][0])
plt.xticks(range(len(labels)), labels, rotation=45)
plt.yticks(range(len(labels)), labels)
plt.title(f"Layer {layer_idx} Head {head_idx}")
plt.savefig("head.png")

Attention 패턴 — head가 실제로 배우는 것들

흔히 emergent하게 나타나는 head 종류

Code

External links

Exercise

Progress

댓글 0