Production 배포 — 그리고 솔직한 한계

실제로 production 에 있는 거

Mamba-family architecture 가 이제 여러 주요 lab 에서 실제 production 에:

AI21 Jamba 1.5 — 256K context, AWS Bedrock, Azure, GCP 가용. major cloud marketplace 에 hit 한 첫 Mamba-family 모델. 1.5 release 가 hybrid-Mamba 가 research 에서 enterprise procurement 로 옮겨진 순간.

NVIDIA Nemotron-H 8B / 56B / 47B — 92% attention-replaced-by-Mamba-2. Llama-3.1 70B 대비 65K context 에서 약 3× 빠름. 56B 는 6,144 H100 위에 20T 토큰으로 FP8 pretrain — 어떤 measure 로도 production scale.

IBM Granite 4.0 — 9:1 Mamba:attention 비율, 비교 가능 pure Transformer 대비 70%+ RAM 감소. IBM 의 enterprise customer 가 비교 가능 quality 에서 더 작은 serving footprint 봐.

TII Falcon Mamba 7B — pure SSM 인데 표준 벤치마크에서 Llama-3.1 8B 이김 (정확히 pure 라서 흥미 — selectivity 가 7B scale 에서 대부분 거리 데려간다는 증명).

Cartesia Llamba-8B — Llama 3.1 8B teacher 대비 12× throughput, Transformer 를 Mamba student 로 distill 해서 달성. distillation 경로가 점점 중요: Mamba 처음부터 학습 대신, 강한 기존 Transformer 가져다가 변환.

솔직한 한계는 안 사라져

Production 검증이 물리학을 폐지하진 않아. Mamba-family 모델은 여전히:

Five-shot MMLU gap — pure SSM 이 few-shot in-context learning 에서 눈에 띄게 underperform. Hybrid 가 대부분 닫고; pure SSM 은 안 닫아.
Nonlinear convolution 의 asymmetry bias — Mamba 가 selective scan 이 정보 누적하는 방식 때문에 sequence 초기 토큰 쪽으로 약간 bias.
Associative recall failure — structural, 이전 트랙에서 establish.
좁은 learning rate window — production team 이 Transformer recipe 대비 hyperparameter sweep 에 더 써야.
Tooling gap — 빠르게 개선 중, 근데 2026 에 Transformer 보다 측정 가능하게 뒤져.

이 중 어느 것도 옳은 워크로드에 deal-breaker 아냐. 다 워크로드가 Mamba 의 강점을 specifically reward 안 할 때 Transformer (또는 Transformer-heavy hybrid) default 로 갈 이유.

Exercise

Mamba-family production 모델 하나 골라 (Jamba 1.5 Mini 가 친한 선택 — open weights, well-documented). Hugging Face 에서 가져다가 로컬에서 돌려, 본인 long-context test (선택한 50K 토큰 문서) 에 대해 generate 하고 같은 input 에서 Llama 3.1 8B 와 quality 비교. specifically Mamba 가 어디서 이기는지 (메모리, throughput) 와 어디서 지는지 (문서 깊이 named entity 정확한 retrieval 필요한 질문) 메모.

Production 배포 — 그리고 솔직한 한계

실제로 production 에 있는 거

솔직한 한계는 안 사라져

External links

Exercise

Progress

댓글 0