Naive GEMM vs Tiled GEMM

같은 곱셈, 두 성능 세계

GEMM (C = A·B)은 '그냥 thread 던져' 가 안 통하는 지점. naive 버전은 thread 하나를 출력 원소 하나에 할당하고 K 차원 loop. tiled 버전은 block 안에서 협력해 A랑 B 청크를 shared memory로 load하고 다음 DRAM trip 전에 각 청크를 여러 번 재사용. 같은 답, 극적으로 다른 속도.

Naive GEMM — 고통을 느껴

각 thread가 C[m][n] 하나 계산:

for k in 0..K: C[m][n] += A[m][k] * B[k][n]

모든 FMA가 global 메모리 load 두 번 요구 (A 한 번, B 한 번). 재사용 없음, shared memory 없음. 커널이 bandwidth-bound, peak FLOPs의 ~5%로 돔.

Tiled GEMM — hot 데이터를 shared memory에

출력 tile당 block의 thread들이 협력해서 A-tile + B-tile을 shared memory에 load. 각 thread가 shared memory에서 read해서 자기 partial 출력 계산 — global load 한 번에 cheap read 여러 번. Inner loop:

다음 A-tile이랑 B-tile load (global → shared).
__syncthreads() — 모든 thread가 load된 데이터 보도록.
Register에 multiply-accumulate: Csub += As[ty][k] * Bs[k][tx].
다음 K-slice로; 반복.

RTX 4090 숫자 (FP32, non-Tensor-Core path):

커널	시간 (ms)	TFLOP/s	비고
gemm-naive (512³)	0.65	0.41	FMA당 DRAM read 한 번
gemm-tiled (2048³)	4.50	3.8	Shared-mem 재사용, register blocking
cuBLAS sgemm	~3.9	~7.0	Multi-stage tiling, cublasGemmEx로 Tensor Core

Naive→tiled 9× 가속은 순전히 데이터 재사용. cuBLAS는 Tensor Core랑 multi-stage software pipelining (Track 8)으로 또 ~2× 추가.

Code

Naive GEMM (CUDA) — 출력 원소당 thread 하나·cuda

// 출력 원소당 thread 하나. FMA당 global load 둘. Bandwidth-bound.
__global__ void gemm_naive(const float* A, const float* B, float* C,
                           int M, int N, int K) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    if (row >= M || col >= N) return;

    float c = 0.f;
    for (int k = 0; k < K; ++k) {
        c += A[row * K + k] * B[k * N + col];   // load 둘, FMA 하나
    }
    C[row * N + col] = c;
}

Tiled GEMM (CUDA) — block 협력 tile load·cuda

// 32×32 tile. 각 block이 C의 32×32 청크 하나 계산.
// Block 안에서 thread들이 협력해 A, B tile을 shared memory에 load,
// sync, 그 다음 shared에서 multiply-accumulate.
#define TILE 32

__global__ void gemm_tiled(const float* A, const float* B, float* C,
                           int M, int N, int K) {
    __shared__ float As[TILE][TILE];
    __shared__ float Bs[TILE][TILE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int row = blockIdx.y * TILE + ty;
    int col = blockIdx.x * TILE + tx;

    float c = 0.f;
    for (int kt = 0; kt < (K + TILE - 1) / TILE; ++kt) {
        int a_col = kt * TILE + tx;
        int b_row = kt * TILE + ty;
        As[ty][tx] = (row < M && a_col < K) ? A[row * K + a_col] : 0.f;
        Bs[ty][tx] = (b_row < K && col < N) ? B[b_row * N + col] : 0.f;
        __syncthreads();

        #pragma unroll
        for (int k = 0; k < TILE; ++k) {
            c += As[ty][k] * Bs[k][tx];          // 둘 다 smem read
        }
        __syncthreads();
    }
    if (row < M && col < N) C[row * N + col] = c;
}

// Launch:
//   dim3 block(TILE, TILE);
//   dim3 grid((N+TILE-1)/TILE, (M+TILE-1)/TILE);
//   gemm_tiled<<<grid, block>>>(dA, dB, dC, M, N, K);

Exercise

쓸 수 있는 GPU에서 1024×1024 FP32 multiply용 naive랑 tiled GEMM (32×32 tile) 둘 다 구현. 둘 다 numpy.dot이랑 검증. timing — naive에서 tiled로 대략 5–10× 가속 보여야 함. 그 다음 2048×2048로 늘려 — 큰 문제가 naive 버전을 더 bandwidth-starved로 만드니까 격차 보통 더 벌어짐.