Linear Attention 과 Performer

Kernelization 트릭

Attention computation softmax(Q·Kᵀ)·V 가 O(n²) 인 건 가운데 n×n matrix 때문. Performer (Choromanski et al., 2021) 와 더 broad 한 linear attention family 가 묻는 거: kernel function φ 로 softmax 교체해서 softmax(Q·Kᵀ) ≈ φ(Q)·φ(K)ᵀ 이 되게 하면?

그런 φ 있으면, φ(Q) · φ(K)ᵀ · V 를 φ(Q) · (φ(K)ᵀ · V) 로 다시 쓸 수 있어. 괄호화 중요: 오른쪽이 d×d matrix 먼저 계산 (cost O(n·d²)), φ(Q) 로 project (cost O(n·d²)). Total: O(n·d²). Sequence length 에 linear, model dimension 에 quadratic. 합리적 model dim 가진 long sequence 에 대해 huge win.

왜 dominate 안 했나

수학은 작동. 실용적 결과는 실망. Random feature map 이 softmax 근사하지만, long context 에서 근사가 더 나빠져 — softmax 를 충실히 근사하기 위해 필요한 φ 의 dimension 이 attention 분포의 entropy 따라 자라고, 많은 토큰에 대한 high-entropy 분포 (long context 가 산출하는 거) 가 근사 cost 를 폭발.

실제 벤치마크가 보여준 거 — speedup 이 가장 중요한 length 에서 Performer-style 모델이 표준 attention 대비 의미 있는 quality 잃음. 커뮤니티 결론: 정확한 attention quality 포기 어려워. 근사가 매우 좋아야지 안 그러면 절약하는 것보다 더 cost.

Lasting 기여

Performer 가 안 이겼지만, 다른 형태로 나중에 더 잘 작동한 conceptual framework establish: 영리한 factorization 통한 structured attention. Kimi Linear, MHLA, gated linear attention 다 이 아이디어에서 후손, 더 영리한 φ 선택과 원본 Performer 의 quality 이슈 극복하는 더 풍부한 state 메커니즘과 함께.

Code

Linear attention factorization — 핵심 트릭·python

# Standard attention: O(n^2 * d)
# attn = softmax(Q @ K.T) @ V              # 가운데 n x n matrix

# Linear attention with kernel phi: O(n * d^2)
# Numerator: phi(Q) @ (phi(K).T @ V)        # 가운데 d x d matrix
# Denominator: phi(Q) @ (phi(K).T @ ones)
# Output: numerator / denominator

# 트릭은 결합법칙:
# (phi(Q) @ phi(K).T) @ V   ==   phi(Q) @ (phi(K).T @ V)
# 좌변: O(n^2 * d). 우변: O(n * d^2).

Exercise

PyTorch 에 random feature map 가진 Performer-style linear attention 구현. 작은 (1, 8, 1024, 64) input 에서 표준 scaled-dot-product attention 과 output 비교. speed 와 output 사이 divergence 둘 다 측정. Short sequence 에 대해 divergence 작지만 speed advantage 도 작은 거 보고; long sequence 에서는 speed advantage 자라지만 divergence 도. 그 tradeoff 가 Performer 못 이긴 이유.