ALiBi — 선형 bias로 위치 주입

ALiBi(Attention with Linear Biases, Press et al. 2022)는 위치에 가장 급진적 접근 — positional embedding 통째 생략. 대신 attention score에 선형 bias 추가, bias는 query/key 위치 간 거리에 비례:

attention_score(i, j) = q_i · k_j^T − m × |i − j|

각 attention head가 자기만의 slope m을 가져 — 고정 기하 수열에서 추출(학습 불필요). 가까운 위치 점수 높고, 먼 위치 선형 penalty. 모델은 positional embedding을 아예 본 적 없어 — 위치는 attention logit에 더해진 bias에만 존재.

이게 가져다주는 것

입력에 위치 파라미터/연산 없음. learned PE 대비 메모리 약 11% 절감, 학습 약 11% 빠름.
훌륭한 외삽. 1024 토큰 시퀀스로 학습한 모델이 2048 토큰에서 품질 저하 없이 동작 — bias 행렬 확장만 하면 끝.
매우 단순. attention 연산에 뺄셈 하나.

ALiBi는 BLOOM이랑 MPT 모델이 사용. 2026년엔 RoPE보다 덜 유명한데, RoPE+YaRN 스케일링이 외삽에서 따라잡으면서 상대 위치=회전이라는 우아함을 유지했기 때문. 그래도 알아둘 가치 있어 — 핵심 통찰(위치는 embedding이 아니라 bias에 살 수 있다)이 후속 연구에 영향.

Code

ALiBi bias matrix·python

import torch

def alibi_bias(seq_len, n_heads):
    # Slopes are a fixed geometric sequence
    def get_slopes(n):
        start = 2 ** (-2 ** -(torch.log2(torch.tensor(n)).item() - 3))
        return torch.tensor([start ** (i + 1) for i in range(n)])

    slopes = get_slopes(n_heads)              # (n_heads,)
    # Distance matrix (n, n) of |i - j|
    pos = torch.arange(seq_len)
    rel = (pos[None, :] - pos[:, None]).abs().float()   # (seq_len, seq_len)

    # Per-head bias: -m * |i - j|
    bias = -slopes[:, None, None] * rel[None]            # (n_heads, n, n)
    return bias

# Inside attention:
# scores = Q @ K.transpose(-2, -1) / d_k**0.5
# scores = scores + alibi_bias(seq_len, n_heads)        # add per-head bias
# weights = softmax(scores, dim=-1)

Exercise

작은 Transformer encoder를 시퀀스 길이 256 합성 copy task로 잠깐 학습. 시퀀스 길이 1024에서 세 위치 인코딩으로 평가 — (a) 256까지 learned PE, (b) 1024로 외삽한 sinusoidal, (c) ALiBi. 어디서 깨지고 어디서 버티나? 정확도 vs. 테스트 시퀀스 길이 플롯.

ALiBi — 선형 bias로 위치 주입

이게 가져다주는 것

Code

External links

Exercise

Progress

댓글 0