SSM 이 다시 흥분되는 이유

마침내 hardware 가 따라잡았어

Deep-learning 아이디어로서의 SSM 은 새롭지 않아 — 2010 년대에 시도됐고 안 갔어. 바뀐 건 hardware. 모던 GPU 는 거대한 matrix-multiply throughput 과 큰 fast memory hierarchy (HBM + SRAM) 가 있어. dual training/inference 트릭 — 학습 때 FFT convolution, 추론 때 recurrence — 가 kernel 만 짜이면 그 hardware 에 아름답게 매핑됐어. S4 가 SSM 이 scale 에 실용적이게 된 순간이야.

Convolution view 가 작동하는 이유는, linear recurrence h_t = Ā·h_{t-1} + B̄·x_t 가 single global convolution kernel K = (CB, CAB, CA²B, ..., CA^(n-1)B) 로 expand 가능. K 를 FFT 로 input 에 적용하면 O(n log n) 학습, 완전 parallel. recurrence view 는 추론 때 사용: cheap, sequential, constant-memory. 같은 parameter, 두 execution pattern.

Production 배포가 증거

2024–2026 production 배포 wave 가 "SSM 은 research" 에서 "SSM 은 ship" 으로 mental model 을 바꿔줘야 해. IBM Granite 4.0 은 production-grade open-weights 모델에 9:1 Mamba-to-attention 비율 사용, 비교 가능 pure Transformer 대비 70%+ RAM 감소 달성. NVIDIA Nemotron-H 는 ~92% attention layer 가 Mamba-2 로 교체된 8B 와 56B 모델 ship. AI21 Jamba 는 effective 256K context 로 AWS Bedrock, Azure, GCP 위에.

이 중 어느 것도 pure SSM 아냐. 다 hybrid 야. 근데 SSM layer 가 heavy lifting 하고 — attention layer 는 pure SSM 이 못 주는 recall 능력 회복하는 게 일인 minority resident. 제품화 스토리가 SSM 베팅 검증: architecture 가 scale 에서, production 에서, 진짜 customer 가 진짜 돈 내면서 작동.

Exercise

IBM Granite 4.0 모델 카드를 Hugging Face 에서 가져와서 architecture section 훑어. Mamba layer 와 attention layer 정확한 비율과 layer count 메모. 그 다음 Jamba 의 비율, Nemotron-H 의 비율 찾아봐. 8:1 에서 9:1 범위에 cluster 하는 거 발견할 거야. 그 수렴은 우연 아냐; production scale 에서 recall-vs-efficiency tradeoff 의 field 가 도달한 경험적 답이야.

마침내 hardware 가 따라잡았어

Production 배포가 증거

External links

Exercise

Progress

댓글 0