Nemotron-H 와 Enterprise Scale

가장 큰 hybrid SSM-attention 모델

NVIDIA Nemotron-H (2025.4) 가 canonical enterprise-scale hybrid: 8B 와 56B / 47B 버전, ~92% attention layer 가 Mamba-2 로 교체. 65K context 에서 Llama-3.1 70B 대비 약 3× 빠름. 56B 모델이 6,144 H100 GPU 위 FP8 로 20T 토큰 사용해 pretrain — frontier scale 의 production pretraining.

이게 "hybrid SSM-attention 이 production-ready 인가" 질문 끝내는 데이터 포인트. Field 에서 가장 hardware-aware 한 회사 NVIDIA 가 hybrid architecture 에 6,144 H100 베팅하고 commercial offering 으로 ship. 그게 research 베팅 아냐; serving 베팅이야.

MiniPuzzle compression — 56B → 47B

NVIDIA 가 또 hybrid architecture 가 aggressive post-training compression 에 amenable 함을 보여줘. 56B 모델이 architectural advantage 위에 20% speed 이득 과 함께 47B parameter 로 distill / prune. Compression 이 부분적으로 작동한 이유는 hybrid stack 이 pure Transformer stack 이 같은 정도로 안 가진 SSM layer 전반 natural redundancy 가져서.

Cosmos-Reason 1 — vision-language application

Nemotron-H 가 NVIDIA 의 physical AI vision-language 모델 Cosmos-Reason 1 의 backbone 역할. Hybrid backbone 이 Cosmos-Reason 의 long-context multi-modal reasoning 을 실용적이게 만드는 것 — scale 에서 high-res video frame 이 정확히 hybrid 가 제공하는 long-context efficiency 필요.

Nemotron 3 — 다음 세대

Nemotron 3 (2025.12 / 2026.3) 이 NVIDIA 의 다음 세대 hybrid: MoE architecture, 1M context, NVFP4 precision (Blackwell GPU 에 최적화된 새 low-precision format). MoE sparsity, hybrid SSM-attention, aggressive low-precision 학습의 조합이 NVIDIA 의 2026–2027 production stack 모습 베팅.

명확한 패턴: NVIDIA 가 generation 마다 hybrid SSM-attention 에 더 commit, 덜이 아니라. Hardware 벤더 자체에서 그 commitment 가 longer-term architecture 결정 plan 할 때 진지하게 받아들일 가치 있는 신호.

Exercise

Hugging Face 에서 NVIDIA Nemotron-H 8B 가져와. output 생성하고 65K context 에서 Llama 3.1 8B 대비 시간 재. 65K 에서 보고된 3× advantage 가 commodity H100 hardware 에서 reproduce 해야. H100 없으면, 가진 거 아무거에서 돌리고 relative scaling 관찰 — 더 작은 GPU 에서도 long context 에서 hybrid 가 앞서 줄 거.