Mamba 가 뭐야 — Selectivity 한 lesson 으로

변화 하나: parameter 가 input 의 함수가 됨

Mamba (Gu & Dao, 2023.12, arXiv:2312.00752) 는 Selective State Space Model — S6 로 약칭 — 이고 한 문장으로 헤드라인 변화는: B, C, Δ 가 fixed 대신 현재 input x_t 의 함수가 됨. 끝. 그게 S4/S4D/H3 와의 전체 개념적 차이.

메커니즘: B, C 는 x_t 의 linear projection 으로 생성. Δ — discretization step size, 개념적으로 per-token "이 토큰이 state 에 얼마나 영향 줘야 해" gate — 도 x_t 에서 생성 (linear projection 후 softplus 로 양수 유지). Δ 가 크면 모델이 현재 토큰을 강하게 통합; Δ 가 작으면 기존 state 가 dominate. Δ 가 selectivity gate — 모델이 "이 토큰 중요해 기억해" 또는 "이 토큰은 filler 스킵" 말할 수 있게.

왜 이게 옳은 종류의 expressiveness 이득인가

LTI 제한은 그냥 수학 한계가 아니었어; 구체적 cost 가 있었어. Pre-Mamba SSM 은 content-aware 압축 못해. 모든 단어가 같은 time-invariant filter 로 처리, key fact 든 stop word 든. Selectivity 와 함께 모델이 high-information 토큰을 state 에 라우팅하고 low-information 은 최소 영향으로 통과시킬 수 있어.

이게 LSTM 과 GRU 의 gate 가 하던 거랑 비슷 — selectivity 는 어떤 의미에서 SSM family 가 S4/S4D/H3 가 단순화해서 없앤 content-dependent gating 을 재발견한 거야. 차이는 selectivity 가 SSM framework 안에 앉아서 control theory 와의 풍부한 연결, hardware-friendly 구조 유지하면서 LSTM 이 수십 년 전에 가졌던 gate 를 얻었다는 것.

대가: convolution view 가 죽어

공짜 점심 없어. B, C, Δ 가 x_t 에 의존하면 더 이상 recurrence 를 single global convolution kernel 로 unroll 못해 — kernel 한 개 가 더 이상 없고, 매 timestep 다른 kernel. 그래서 S4 의 FFT-based parallel 학습 트릭이 안 통해. Mamba 는 학습 중에 parallel 할 다른 방법을 발명해야 했고, 그게 다음 lesson: hardware-aware selective scan.

Code

Mamba S6 step — per-token update pseudocode·python

# x_t: (batch, d_model) — timestep t 의 input
# A: (d_state, d_state) — fixed (HiPPO-style 로 init)
# h_{t-1}: (batch, d_state) — 이전 state

B_t = linear_B(x_t)            # (batch, d_state)
C_t = linear_C(x_t)            # (batch, d_state)
delta_t = softplus(linear_d(x_t))  # (batch, 1) — selectivity gate

A_bar = torch.exp(delta_t * A)         # discretized state transition
B_bar = delta_t * B_t                  # discretized input projection

h_t = A_bar @ h_{t-1} + B_bar * x_t    # state update (이제 input-dependent)
y_t = C_t * h_t                        # output

Exercise

mamba-ssm 설치 (CUDA 필요) 하고 Hugging Face 의 Mamba-130M checkpoint 를 짧은 generation task 에서 돌려봐. 1K, 8K, 32K, 128K 토큰에서 prefill 시간 재고 같은 length 의 Pythia-160M (비슷한 parameter count 의 Transformer) 와 비교. Mamba 의 prefill 시간이 ~linear 로 자라고 Pythia 는 ~quadratic 으로 자라는 거 봐야 해 — selectivity 가 그 asymptotic 스토리는 안 바꿨어.