Efficient Attention 으로 충분할 시점

Efficient attention 의 sweet spot

Sequence 가 대부분 64K 토큰 미만이고 강한 recall 필요하면, efficient Transformer variant 가 실용 선택. Sliding window (Mistral-style), GQA-aggressive variant, NSA-style 학습된 sparsity — 이게 stack 의 다른 모든 거랑 compose 하고 architectural retraining 필요 없는 upgrade.

"efficient 충분" Transformer 가 ecosystem 지원 부족한 "이론적으로 더 나은" alternative 이김. immature alternative architecture 의 10% quality 손실이 종종 production 에서 (bug, support 시간, 빠진 feature) 3× 추론 cost 절약이 사는 것보다 더 cost.

Crossover point — 진짜

20–30K 토큰 미만, 최적화된 Transformer (FA3 + GQA + sliding window) 가 SSM 과 hybrid 보다 real wall-clock 측정에서 보통 더 빠름. Asymptotic complexity 주장이 constant-factor zone 너머갈 때까지 안 kick. P95 input 이 20K 미만이면, 완전히 crossover 아래 있고 post-Transformer 논문을 stack 에 직접 관련 있는 것처럼 읽기 멈춰야.

"Linear 가 항상 더 나아" myth

Myth: "Linear-time 이 항상 이김." Reality: 20–30K 토큰 미만, 최적화된 Transformer 가 wall-clock latency 에서 이김. Crossover 가 hardware (더 빠른 GPU 가 나중으로 push), batch size (더 큰 batch 가 나중으로 push), 구현 성숙도 (더 mature kernel 이 나중으로 push) 에 의존. FlashAttention-3 가진 2026 H100 이 naive attention 가진 2024 A100 보다 나중에 crossover 가능.

핵심: 항상 실제 워크로드로 벤치마크. "Linear attention 이 더 빠름" sequence-length 와 hardware specifier 없이는 거의-무의미한 주장. "Linear attention 이 batch size 4 의 64K 토큰 위 H100 에서 더 빠름" 이 specific, useful 주장.

Exercise

워크로드의 representative input 셋 가져 (가장 작은, median, 가장 큰), default attention 가진 Transformer, FlashAttention 가진, sliding window 가진 통해 시간 재. 핵심은 architecture switch 아냐 — stack 이 실제 산출하는 숫자 아는 거. 그 숫자 손에 가지고, 모든 architecture-change 논의가 더 구체적이 돼.