Kimi Linear, MHLA, Gated Attention

Kimi Linear — 2025 헤드라인 결과

Kimi Linear (Moonshot AI, 2025.10) 가 production scale 에서 full Multi-head Latent Attention (MLA) 대비 across-the-board 우위 주장하는 첫 linear-attention variant. 헤드라인 숫자: 75% KV-cache 감소, 1M context 에서 6× 빠른 decoding, 표준 벤치마크에서 competitive 또는 더 나은 quality. Architecture 가 KDA (Key-Dependent Attention) 사용, linear-attention state 가 key 에 의존하는 방식으로 update 되는 variant (linear attention 안의 selectivity 같은 거).

Kimi Linear 결과가 독립 third-party evaluation 에서 hold up 하면, 이게 linear attention 이 efficiency advantage 보존하면서 quality 에서 마침내 full attention 따라잡은 순간. Performer 의 옛 failure mode (long context 에서 quality degradation) 가 더 영리한 random feature 가 아니라 더 영리한 state 디자인으로 해결된 것 같아.

MHLA — global-context-collapse 픽스

Linear-attention 모델의 반복 failure mode 중 하나는 global context collapse: 더 많은 토큰 처리하면서 running state 가 너무 많은 거 평균, 특정 토큰 정보가 mush 로 fade. MHLA (Multi-Head Linear Attention, 2026.1) 가 토큰을 block 으로 partition 하고 더 broad 한 context 구조 보존하는 local KV summary 유지해서 이를 address.

MHLA 가 반복될 가능성 있는 구체적 패턴: linear attention 의 이슈가 모든 history 가 함께 blend 한다면, history 를 block 으로 구조화해서 blending 이 명시적이고 locality 보존. 이 패턴의 variant 가 Mamba 의 MIMO formulation 에, RetNet 의 chunkwise mode 에 — 같은 structural 우려에 대한 다른 solution.

Gated Attention — compound 하는 작은 win

Gated Attention (NeurIPS 2025 Best Paper) 이 보여준 거 — 표준 attention 에 minimal modification 도 — attention output 에 단순 gating 메커니즘 추가 — minimal cost 에 측정 가능한 accuracy 이득. Lesson: attention 자체 refining 에 여전히 의미 있는 headroom, 그것의 alternative 찾는 거에만 아니라.

이게 post-Transformer 스토리의 unsung 절반. 다른 절반이 dramatic alternative. 둘 다 진짜고, 건강한 팀은 두 lane 다 인지해야.

Exercise

Kimi Linear 논문 (또는 더 접근 가능하면 팀 블로그 포스트) 와 Gated Attention NeurIPS 논문 읽어. 각자에서 가장 surprising 한 single 결과 식별, production architecture 대화 어떻게 바꾸는지 생각. 핵심은 숫자 외우기 아냐; 각 새 architecture 논문을 "우리가 뭘 해야 하는지에 대해 어떤 한 가지가 바뀌었나?" 로 distill 하는 습관 develop.

Kimi Linear — 2025 헤드라인 결과

MHLA — global-context-collapse 픽스

Gated Attention — compound 하는 작은 win

External links

Exercise

Progress

댓글 0