라이브러리 비밀 한눈에

Vendor BLAS를 peak 90%로 올리는 다섯 기둥

손코딩 tiled GEMM은 peak ~5%. cuBLAS는 80%+. 격차가 수십 년 누적된 엔지니어링 다섯 기둥 위에 세워져. 체크리스트로 읽어; 어느 기둥이든 깨지면 vendor 성능도 무너짐.

기둥	하는 일	관찰 위치
Blocking / tiling	Hot 데이터를 shared / L2에 유지 — Track 6의 tiled GEMM이랑 같은 아이디어, 단 multi-level	Nsight 'shared mem reads', Xcode 'L2 hit ratio'
Micro-kernel (Tensor / matrix unit)	Instruction당 32–128 FLOP하는 fused 행렬곱 op (FMA당 아님)	SASS `mma.*` instruction, Apple Instruments matrix-unit counter
Split-K & Streaming-K	K 차원을 block 가로질러 병렬화, 그 다음 partial sum reduce	`cublasLtMatmulAlgoGetHeuristic`, MLX의 split-K path
Heuristic algo picker	특정 shape용 tile size / epilogue / splitK 자동 검색	Nsight 'Algorithm ID' 패널, cuBLASLt 로그
Epilogue fusion	커널의 final loop에 bias add + activation + residual add fold	cuBLASLt `activationType`, `β·C` 처리

표 한 번 읽어. 그 다음 주목: 손코딩 tiled GEMM은 첫 번째 기둥만 가짐. 5%인 이유. 나머지 기둥 각각이 독립적으로 ~2× 가치 — 곱하면 격차 이해돼.

Production에서 각 기둥 깨는 거

작은 m이나 k — micro-kernel이 기대 모양 못 채움; 라이브러리가 GEMV-grade 커널로 fallback.
Hidden size가 16 배수 아님 — Tensor Core path는 이거 요구; off-multiple은 더 느린 fallback 강제.
L2 thrashing — concurrent 커널이 tile evict; 명백한 이유 없이 성능 절반.
Shared memory bank conflict — tile loader stall; profiler가 'shared bank conflict' 높게 보여줌.

Code

cuBLASLt heuristic 출력 읽기 — 어느 algo 픽됐나?·cuda

// cuBLASLt이 algo picker 보이게 하는 길.
// (Pseudocode — 진짜 API는 장황; docs 참고.)

cublasLtMatmulHeuristicResult_t heuristic[10];
int returnedResults = 0;
cublasLtMatmulAlgoGetHeuristic(
    ltHandle,                 // lib handle
    matmulDesc,               // operation 설명
    Adesc, Bdesc, Cdesc, Cdesc,
    preference,
    /*requestedAlgoCount=*/ 10,
    heuristic,
    &returnedResults);

for (int i = 0; i < returnedResults; ++i) {
    printf("Algo %d: workspace=%zu, waves=%f\n",
           i, heuristic[i].workspaceSize, heuristic[i].wavesCount);
}
// 첫 entry가 특정 (m, n, k, dtype)에 대한 라이브러리의 best pick.

Force vs auto — 라이브러리 픽 override 시점·cuda

// Heuristic이 가끔 틀려 (특히 unusual shape).
// 사용 가능한 algo 스윕해서 진짜 winner 픽 가능:

// 1. 이 op에 지원되는 algorithm ID 다 가져옴
int algos[32]; int n_algos;
cublasLtMatmulAlgoGetIds(handle, ..., 32, algos, &n_algos);

// 2. 각각을 warmup run 몇 번에 걸쳐 timing
float best_ms = INFINITY; int best = -1;
for (int i = 0; i < n_algos; ++i) {
    float ms = bench_algo(algos[i], A, B, C, m, n, k);
    if (ms < best_ms) { best_ms = ms; best = i; }
}

// 3. Winner 캐시 — algo 선택은 shape-specific
shape_to_algo[std::make_tuple(m, n, k)] = algos[best];

Exercise

CUDA 박스에서 4096³ FP16 GEMM을 default 설정으로 cuBLAS 한 번 돌리고, 어느 Tensor Core algo 픽됐는지 탐색 (Nsight Compute 사용 또는 cuBLASLt 로그 read). 그 다음 일부러 non-Tensor-Core algo 써서 다시 돌려. 관찰되는 2–3× 격차가 기둥 2 (micro-kernel) 단독 가치 — 그리고 라이브러리가 자동 선택하는 건 기둥 4 (heuristic picker) 덕분.