Hybrid 가 이기는 이유

"둘 다" 의 산수

Pure Transformer 는 perfect recall 인데 quadratic cost. Pure SSM 은 linear cost 인데 lossy memory. 자연스러운 질문은 둘 다 가능한지 — 그리고 field 가 수렴한 경험적 답은: 네, 전략적으로 섞어서. SSM (또는 RWKV-class) layer 를 stack 의 bulk — 80–90% — 위해 사용, 모델이 정확한 recall 필요한 전략적 depth 에 sparse attention layer 배치, 나머지 위해.

산수가 작동 이유 설명. layer 의 90% 가 O(n) cost 이고 10% 가 O(n²) 이면, dominant cost 는 여전히 O(n²) 인데, constant factor 가 10× 작아. Long context (attention 의 quadratic cost 가 가장 중요) 에서, 이 constant-factor 감소가 실전에서 대략 order of magnitude — pure SSM 이 잃는 recall 능력 유지하면서.

몇 attention layer 면 충분?

경험적으로 발견된 답이 계속 약 SSM 7–9 layer 당 attention 1 layer 로 나와. NVIDIA 의 Mamba-2-Hybrid study 가 발견한 거 — 총 28 중 attention layer 4 개만 추가해도 in-context learning 능력 substantially 회복. 그 비율 미만, ICL collapse; 그 위로 많이, efficiency 대부분 포기. 1:7 에서 1:9 range 가 독립 팀들이 발견한 stable empirical sweet spot.

이게 진지하게 받아들일 가치 있는 종류의 경험적 수렴. 서로 안 얘기하던 여러 lab — AI21, NVIDIA, IBM, Zyphra — 이 비슷한 attention:SSM 비율에 독립적으로 도달. 그게 비율이 production scale 에서 recall-vs-efficiency tradeoff 에 대한 진짜 뭔가 반영한다는 강한 신호.

전략적 depth 질문

그냥 "몇 개" attention layer 가 아냐 — 어디. 대부분 production hybrid 가 attention layer 를 모델이 더 high-level integration 하는 middle-to-late depth 에 배치. 초기 layer (대부분 token-level 처리) 와 매우 마지막 layer (output projection) 는 pure SSM 으로 OK 한 경향. middle layer 가 recall 가장 중요한 곳, 거기서 attention 이 자기 자리 벌어.

이게 heuristic, theorem 아냐. 다른 production 디자인이 attention 을 약간 다른 곳에 둬. 처음부터 hybrid build 한다면, Jamba, Nemotron-H, Granite 4.0 의 depth 분포 study; 그것들이 cluster 하지만 정확히 동의하진 않는 거 발견.

"둘 다" 의 산수

몇 attention layer 면 충분?

전략적 depth 질문

External links

Exercise

Progress

댓글 0