Hybrid Zoo — Griffin, Hymba, Bamba, Zamba2, SAMBA

독립 수렴

Jamba 가 ship 하자, 여러 팀이 자기들 twist 와 hybrid architecture 에 독립적으로 수렴. 접근의 다양성 자체가 신호 — single "옳은" hybrid 없음; 다 작동하는 많은 그럴듯한 디자인 있음. 2026 의 가장 중요한 것들 투어:

Griffin / RecurrentGemma (Google, 2024)

Real-gated linear recurrent unit (RG-LRU) 와 sliding window attention 결합. 비교 가능 scale 의 Gemma 의 ~2K 대비 약 6K tok/s. RecurrentGemma 가 open-weights distillation; architecture 가 Google 의 productionize 된 alternative 중 하나. 여기 recurrent component 는 Mamba-flavored 보다 RWKV-flavored.

Hymba (NVIDIA, 2024)

가장 architecturally novel 한 hybrid. SSM 과 attention layer 를 alternating 대신, Hymba 는 같은 layer 안에서 parallel 로 돌려 — attention head 와 Mamba head 가 parallel, output 에서 fuse. 약 11.67× cache 감소. Parallel fusion 접근이 흥미로운 이유는 sequencing 대신 두 연산에서 per-layer 이득 얻어서.

Bamba (IBM, 2024)

9B 모델, 29 Mamba2 layer + 3 attention layer. 비교 가능 Transformer 대비 2.5× throughput. day-0 vLLM 지원 으로 notable — release 시 serving 인프라와의 integration ready. 채택 위해 벤치마크 숫자보다 더 중요한 종류 디테일.

Zamba2 (Zyphra, 2024)

Mamba2 block 과 함께 shared-attention-backbone 디자인. 4× 빠른 generation. Zyphra 의 접근이 흥미로운 이유는 attention backbone 이 stack 전반 공유, 이게 pure layer alternation 과 다른 방식으로 recall 과 efficiency balance.

SAMBA

Mamba + sliding window attention 동등한 부분. SAMBA 가 counter-data-point 로 흥미: 1:7 아닌 1:1 비율, specific 구성에서 잘 작동. Optimal 비율이 디자인 공간에 달려 있다는 reminder — SAMBA 의 sliding-window flavor 의 attention 이 full attention 보다 layer 당 더 할 수 있어.

Exercise

이 list (Jamba, Hymba, Bamba, Zamba2, Griffin, SAMBA) 에서 hybrid architecture 셋 골라 각 논문의 architecture section 읽어. 각자에 대해 (a) 뭘 hybridize 하는지, (b) 연산이 어떻게 결합되는지 (alternating vs parallel vs shared backbone), (c) quality/efficiency 에 대해 어떤 주장 하는지, (d) production status 가 뭔지 묘사하는 한-paragraph 요약 작성. 핵심은 winner 고르기 아니라 디자인 공간의 felt sense 개발하기.