Mamba 와 State-Space Model — 선형 scaling · Beyond the Transformer Quest

전제

Transformer 가 quadratic scaling 문제: attention 이 sequence length 에 O(n²). Context length 두 배가 attention compute 와 KV-cache 크기 4 배. State-Space Model (SSM) 이 attention 을 선형-scaling 대안으로 교체: O(n). 매우 긴 sequence 에서 어떤 attention-optimization 도 매치 못 하는 asymptotic win.

Mamba 가 SSM 에 추가한 것

이전 SSM (S4, S5) 이 fixed dynamics — state-update 방정식이 input 의존 안 함. Mamba (2023년 12월) 가 selective state space 도입: input-dependent dynamics, 모델이 현재 토큰 기반 과거 state 얼마나 유지, 잊기, 업데이트 결정. 이 단일 변화가 attention 과의 quality gap 대부분 닫고 선형-scaling 우위 보존.

Mental model

Attention 은 "similarity 로 weight 한 모든 이전 토큰 봐". SSM 은 "fixed-size state 가지고 다니며 새 토큰 도착마다 업데이트, original 토큰 버려". Attention 이 perfect recall 하지만 expensive scaling. SSM 이 bounded memory 지만 선형 scaling. 무엇이 중요한지에 대한 다른 베팅.

Mamba 2 — duality 결과

Mamba 2 (2024) 가 selective SSM 과 attention 이 "structured state-space duality" (SSD) framework 하 unified 임 보여줌 — 비슷한 underlying 계산의 다른 parameterization. Mamba 2 가 attention engineering 의 최적화 기법 borrow 가능, 더 빠른 학습 yield. Mamba 3 (2026) 이 더 refine.

SSM 이 이기는 곳

~370 토큰 위에서 Mamba inference 가 동등한 Transformer 보다 빨라져. 4K 토큰까지 Mamba 가 ~12배 메모리 효율 우위. 매우-긴-context streaming 워크로드 (transcription, log analysis, real-time monitoring) 에 SSM 이 경쟁적 또는 우월.

SSM 이 여전히 struggle 하는 곳

Exact recall. SSM 이 과거 context 를 fixed-size state 로 compress. Long-range exact recall ("100K-토큰 doc 의 90% 묻힌 사실 찾기" 종류) 가 더 어려워.
Frontier-scale training. ~10B 파라미터 위 pure SSM 모델 드물어; scaling-law 자신감이 Transformer 보다 훨씬 약함.
Reasoning 워크로드. Extended thinking 이 long sequence 위 precise recall 혜택 — attention 강점.

Code

Selective SSM update (직관, 실제 구현 아님)·python

def selective_ssm_step(x_t, h_t, A, B, C, delta_t):
    # x_t: current input token's hidden representation
    # h_t: state carried from previous step
    # delta_t: input-dependent step size — the 'selective' part
    h_t_new = (1 - delta_t) * (A @ h_t) + delta_t * (B @ x_t)
    y_t = C @ h_t_new
    return y_t, h_t_new
# Linear in sequence length — no quadratic attention computation.

Mamba 와 State-Space Model — 선형 scaling

전제

Mamba 가 SSM 에 추가한 것

Mental model

Mamba 2 — duality 결과

SSM 이 이기는 곳

SSM 이 여전히 struggle 하는 곳

Code

External links

Exercise

Progress

댓글 0