Hardware Advantage — GEMM, Tensor Core, 그리고 Scan 이 어려운 이유

GPU 가 matmul 머신

Modern GPU — H100, B200, 그 너머 — 가 first and foremost dense matrix-multiply 엔진 으로 디자인. Tensor core 가 큰 matmul 연산으로 efficiently saturate. 전체 메모리 hierarchy (HBM → L2 → SRAM) 가 matmul 이 산출하는 access 패턴에 tuned.

표준 attention 이 본질적으로 matmul 셋: Q·Kᵀ, softmax (non-matmul interlude), weights·V. FlashAttention-3 가 이를 Hopper hardware 에 완벽 매핑되는 tile-based 패턴으로 elegantly fuse. 결과는 H100 이론 peak 의 ~85% — 칩에서 가능한 upper bound 에 가까움.

Parallel scan 이 structurally 달라

SSM 은 학습-시간 parallelism 위해 parallel scan algorithm 필요. Scan 이 matmul 과 structurally 달라: sequential 요소 사이 dependency 가지고, parallelize 하려면 영리한 algorithmic 트릭 (Blelloch scan, Brent-Kung adder 패턴) 필요. 트릭 작동, 결과가 matmul 만큼 깔끔하게 tensor core 에 매핑 안 돼. Scan 을 GPU SRAM 에 tile-by-tile 매핑이 더 어렵고, constant 가 더 나빠.

TFLA kernel (NeurIPS 2025) 이 여기 major breakthrough — H100 에서 linear RNN 위해 마침내 FlashAttention-3 speed 매치. 근데 TFLA 가 쓰는 데 dedicated kernel 엔지니어링 수년 걸렸고, FlashAttention 의 cross-architecture 성숙 수준에 비교 가능한 거 아직 없어. 모든 새 scan-based architecture 가 비슷한 kernel 위해 싸워야.

Hardware-software co-design 시작 중

Custom ASIC 이 SSM-friendly 디자인 타겟 시작. Google Ironwood (TPU v5p generation) 와 AWS Trainium3 가 parallel scan 과 recurrent compute 를 더 잘 지원하는 feature 가져. Cerebras WSE 와 Groq LPU 가 항상 NVIDIA GPU 와 다른 sweet spot 가졌고 곳에 따라 더 SSM-friendly.

근데 GPU ecosystem 의 head start 가 수년과 수십억 달러 최적화로 측정. Alternative 위한 hardware-software co-design 이 초기 단계 — 유망한데 아직 결정적 아냐. Nemotron 통한 NVIDIA 의 hybrid SSM-attention commitment 자체가 GPU 쪽이 결국 hybrid stack 위한 first-class kernel 지원 가질 거라는 가장 강한 single 신호.

Exercise

같은 single GPU 에서 비슷한 parameter count 의 Mamba-130M 과 Transformer 를 torch.profiler 또는 nsys 사용해 profile. kernel trace 봐: 시간의 어떤 fraction 이 compute vs memory access 에 사용? Mamba kernel 이 FlashAttention call 보다 memory-access 패턴에 더 많은 relative 시간 보내는 거 봐야. 그게 scan-vs-matmul gap, profiler output 에 visible.