모든 alternative 가 attention 버리진 않아

Attention frame 안의 세 family

Attention frame 안에 머물면, 세 접근이 dominate. Kernelized attention 이 kernel function 으로 softmax 근사, computation 을 sequence length 에 linear 로 factorize. Sparse attention 이 토큰의 subset 에만 attend — sliding window, global token, 학습된 sparsity 패턴. Structured attention 이 efficiency 위해 computation 순서 또는 attention 모양 (multi-head latent attention, grouped attention, multi-query attention) 재배치.

공통 thread: 셋 다 API level 에서 attention 추상 보존. FlashAttention-3 로 학습된 모델이 stack 나머지 안 바꾸고 layer 하나 sliding-window attention 으로 swap 가능. 이 drop-in 호환성 이 killer feature — 전체 Transformer ecosystem (vLLM, SGLang, TRT-LLM, HF Transformers) 유지하면서 cost 만 lower.

왜 전략적으로 중요한가

2026 의 대부분 팀한테 vanilla Transformer 에서 현실적 upgrade 경로는 "Mamba 로 교체" 가 아니라 "attention layer 일부를 sparse attention 으로 교체". Architectural risk cost 가 낮아 — 여전히 attention paradigm 안 — upside (long context 에서 1.5–4× efficiency) 가 의미 있어. Pure-replacement architecture (Mamba, RWKV) 는 더 substantial commitment 필요.

이게 또 2023–2026 의 Transformer 개선 대부분이 실제 ship 한 lane. GQA, MQA, sliding window, NSA, MoBA, Kimi Linear — 다 어떤 pure-replacement architecture 보다도 빠르게 research 에서 production 으로 line 넘었어. Ecosystem 효과 compound: 이 lane 의 개선이 FlashAttention-3, GQA, speculative decoding, quantization 위에 stack, pure-replacement lane 의 개선은 보통 각 integration 처음부터 rebuild.

Exercise

architecture 가 sliding-window attention 지원하는 Hugging Face Transformers 모델 골라 (Mistral 7B 가 canonical 선택). 32K context 에서 두 추론 벤치마크 돌려: 하나 default full attention 으로, 하나 sliding window enabled. tokens/s 와 peak memory 측정. Sliding-window run 이 모델 안 깨고 눈에 띄게 cheaper 해야 — 이게 이 lane 의 가치 제안.

모든 alternative 가 attention 버리진 않아

Attention frame 안의 세 family

왜 전략적으로 중요한가

External links

Exercise

Progress

댓글 0