Modern GEMM Architecture

3단 tile 피라미드 — 같은 모양, 두 생태계

CUDA랑 Metal에서 state-of-the-art GEMM이 같은 3단 tiling 구조로 수렴. Outer tile은 shared / threadgroup memory에 fit. Middle tile은 register에 fit. Inner tile은 matrix unit의 native 모양. 차이는 spell이랑 정확한 dimension이지 개념 아냐.

CUDA (Ada Lovelace / Hopper)

┌── 128×128  thread-block tile (CTA) ───┐
│   ┌─ 64×64  warp tile ─┐               │
│   │ 16×8×16 tensor-core │  ·         │  ← register + tensor core
│   │ 16×8×16 tensor-core │  ·         │
│   └────────────────────┘  ·         │
└──────────────────────────────────────┘ ← shared memory (double-buffered)

Metal (Apple M-series)

┌── 128×128  threadgroup tile ───┐
│   ┌─ 64×64  SIMD-group tile ─┐  │
│   │ 8×8 matrix unit (FP16)   │··│  ← register + matrix coprocessor
│   └─────────────────────────┘··│
└────────────────────────────────┘ ← threadgroup memory (double-buffered)

Takeaway 셋:

같은 피라미드, 다른 micro-tile size. NVIDIA Tensor Core는 instruction당 16×8×16 (m×n×k) FP16 tile에서 동작. Apple matrix unit은 8×8 FP16. 다른 vendor IP, 동일한 role.
smem layer에서 double-buffering 필수. Warp가 tile i에 compute하는 동안 tile i+1의 next-tile load가 병렬로 일어남. 없으면 matrix unit stall.
Tile size는 shape별 auto-tune. 128×128이 전형이지만 tall-skinny GEMM (m=4, n=4096, k=4096 — single-token decode)엔 라이브러리가 32×64 + split-K 픽 가능.

Code

Tensor Core micro-kernel (CUDA WMMA API) — 각 instruction이 하는 일·cuda

#include <mma.h>
using namespace nvcuda::wmma;

// 텐서코어 instruction 한 개: 16×16 = 16×16 + 16×16 (FP16 in, FP32 accum)
fragment<matrix_a, 16, 16, 16, half, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, half, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

fill_fragment(c_frag, 0.0f);

// 커널의 K loop 안:
load_matrix_sync(a_frag, smemA + offset, lda);
load_matrix_sync(b_frag, smemB + offset, ldb);
mma_sync(c_frag, a_frag, b_frag, c_frag);   // 16*16*16*2 = 8192 FLOPs / instruction

// CUTLASS가 이거 다 wrap; WMMA 손으로 거의 안 씀.

Apple matrix coprocessor via SIMD-group op (Metal)·metal

#include <metal_stdlib>
#include <metal_simdgroup_matrix>
using namespace metal;

// Apple의 Tensor Core equivalent: simdgroup_matrix.
// 곱당 8×8 half-precision tile.
kernel void gemm_simd(
    const device half *A [[buffer(0)]],
    const device half *B [[buffer(1)]],
    device       half *C [[buffer(2)]],
    uint2 tg_id [[threadgroup_position_in_grid]],
    uint  simd_lane [[thread_index_in_simdgroup]])
{
    simdgroup_matrix<half, 8, 8> A_frag, B_frag, C_frag(0);

    // 8×8 tile load, multiply-accumulate; 라이브러리 코드가 wrap.
    simdgroup_load(A_frag, A + offset, lda);
    simdgroup_load(B_frag, B + offset, ldb);
    simdgroup_multiply_accumulate(C_frag, A_frag, B_frag, C_frag);

    simdgroup_store(C_frag, C + offset, ldc);
}

Exercise

이 중 하나의 50–100 줄 read: (a) CUTLASS의 cutlass/gemm/kernel/gemm.h, (b) MLX의 mlx/backend/metal/kernels/gemm.metal. 모든 줄 이해하려고 하지 말고 — 3단 tile 구조 (block tile / warp tile / micro-tile)랑 double-buffered shared-memory load만 짚어. 보이면 ship된 코드에서 architecture 교훈 구체화된 거.