행렬 layout & coalesced access

벡터로 가득한 책장을 그려봐

행렬은 벡터의 stack. 각 row (또는 column)을 자기 벡터로 슬라이스 가능, 근데 책장 전체는 contiguous 메모리 청크 하나에 들어가야 함. 어느 차원이 contiguous냐가 layout 선택.

GPU mantra: coalesce 아니면 죽음. Warp의 32 인접 thread가 32 contiguous 주소 read하면 하드웨어가 한 번 (또는 한 번 가까이)의 메모리 transaction에 처리. 그 패턴 깨면 bandwidth 폭락.

Layout	메모리 순서	적합한 경우
Row-major	Row 원소 인접 (column 따라 stride 1)	Thread가 row 가로질러 stride
Column-major	Column 원소 인접 (row 따라 stride 1)	Thread가 column 따라 내려감 (Fortran/cuBLAS)

RTX 4090, 4096×4096 FP32 row-sum reality check:

Row-major + thread가 row 스캔 (coalesced): 610 GB/s
Column-major + thread가 row 스캔 (strided): 105 GB/s

전적으로 access 패턴이 만든 6× 격차.

Leading dimension (ld)

'Leading dimension'은 연속 row (또는 column) 사이 stride. 빈틈없이 packed 행렬은 row-major면 ld == cols, column-major면 ld == rows. 근데 메모리 bank conflict 피하려고 padding 자주 — ld > cols. 모든 cuBLAS 호출이 lda, ldb, ldc 받는 이유 정확히 이거.

Code

Row-major 행렬 sum (coalesced) — 빠름·cuda

__global__ void row_sum(const float* M, float* out, int rows, int cols, int ld) {
    int row = blockIdx.x;                  // row당 block 하나
    int tid = threadIdx.x;
    if (row >= rows) return;

    // Warp의 thread들이 연속한 M[row][0..cols] 원소 read
    // — 완벽 coalesced. 각 thread가 partial sum 누적,
    // 그 다음 shared memory에서 reduce (다음 레슨).
    float s = 0.f;
    for (int c = tid; c < cols; c += blockDim.x) {
        s += M[row * ld + c];
    }
    // ... block의 thread 가로질러 reduction (lesson 2) ...
}

Row 단위로 read하는 column-major 행렬 sum (strided) — 느림·cuda

// 같은 논리적 op이지만 column-major 저장:
//   M[row, col] = M_buf[col * ld + row]
// Warp의 thread들이 row in 0..warp_size에 대해 M_buf[col*ld+row]
// read — step마다 `ld`만큼 stride → uncoalesced.
__global__ void row_sum_colmajor(const float* M_buf, float* out,
                                  int rows, int cols, int ld) {
    int row = blockIdx.x;
    int tid = threadIdx.x;
    if (row >= rows) return;
    float s = 0.f;
    for (int c = tid; c < cols; c += blockDim.x) {
        s += M_buf[c * ld + row];   // 각 access가 `ld` stride
    }
    // 같은 수학, bandwidth-bound 커널에서 ~6× 느림.
}

Exercise

쓸 수 있는 GPU에서 4096×4096 row-major 행렬 잡고 커널 둘 timing: (a) row당 block 하나, thread가 column 따라 stride해서 row 합산; (b) column당 block 하나, thread가 row-major buffer의 row 내려가며 column 합산. 차이 (5–10× 자주)가 순전히 access 패턴 — 수학 아냐. (b) fix는 행렬 먼저 transpose하거나, transpose 후 각 block이 자기 column contiguous 스캔하게 재구성.