메모리 layout & roofline

같은 행렬, 두 layout, bandwidth 6× 차이

2-D 행렬이 메모리에선 사실 2-D로 안 살아 — RAM은 큰 1-D array야. 어떻게 펼치냐 선택이 어느 thread가 효율적으로 읽을 수 있는지 바꿔.

Row-major (C order) — 주소가 row 따라 행진: M[i][j]랑 M[i][j+1]이 메모리에서 인접. C/C++, Python/NumPy, PyTorch, MLX, Metal default.
Column-major (Fortran order) — 주소가 column 따라 내려감: M[i][j]랑 M[i+1][j]이 인접. Fortran, MATLAB, R, 그리고 (BLAS가 1979년 spec 됐기 때문에) cuBLAS default.

왜 중요하냐 — GPU는 메모리를 coalesced 청크로 읽어: warp 안 32 thread가 32 contiguous 주소 요청하면 한 transaction에 가져와. 그 thread들이 non-contiguous하게 stride하면 하드웨어가 만족시키려고 transaction 여러 번 — bandwidth 폭락.

구체 숫자, RTX 4090에서 4096×4096 FP32 sum:

Row-major, thread가 row 스캔 (coalesced): 610 GB/s
Column-major, thread가 row 스캔 (strided): 105 GB/s

전적으로 메모리 access 패턴이 만든 6× 격차야.

Roofline: bandwidth + compute 한 천장으로 합치기

Roofline 모델 말하길: 달성 가능한 성능은

min(peak FLOPs, peak bandwidth × arithmetic intensity)로 bound.

X축에 커널 intensity, Y축에 달성 FLOP 그려; 비스듬한 bandwidth 천장 (memory-bound, 낮은 intensity) 아래거나 평평한 compute 천장 (compute-bound, 높은 intensity) 아래야. roofline knee의 어느 쪽이냐가 어떤 최적화 쫓을지 알려줘.

Code

NumPy에서 row-major vs column-major — 같은 데이터, 다른 stride·python

import numpy as np

# Row-major (default)
A = np.zeros((4, 5), order='C')
print(A.strides)   # (40, 8) — row 사이 40 byte, col 사이 8

# Column-major
B = np.zeros((4, 5), order='F')
print(B.strides)   # (8, 32) — row 사이 8, col 사이 32

# Row-major 행렬에서 row 합산할 때 inner loop가 contiguous
# 메모리 walk (좋음).
# Row-major 행렬에서 col 합산할 때 inner loop가 step당
# row-stride byte stride (나쁨).
# GPU는 warp가 32 너비로 coalesce하니까 차이 증폭.

# A.sum(axis=1) on row-major  — 빠름
# A.sum(axis=0) on row-major  — 느림
# (Column-major면 반대)

Roofline 산수 — 어느 천장 아래?·python

# RTX 4090 스펙 (FP32 non-Tensor-Core)
PEAK_FLOPS_FP32 = 82e12       # 82 TFLOP/s
PEAK_BW         = 1.0e12      # 1.0 TB/s GDDR6X

def roofline(intensity):
    return min(PEAK_FLOPS_FP32, PEAK_BW * intensity)

# 작은 AXPY: 0.08 FLOPs/byte
print(roofline(0.08))      # ~ 80 GFLOP/s — bandwidth-bound

# 4096³ GEMM: ~1365 FLOPs/byte
print(roofline(1365))      # 82 TFLOP/s에서 cap — compute-bound

# Roofline knee (bandwidth = compute):
knee = PEAK_FLOPS_FP32 / PEAK_BW
print(f'Knee at intensity = {knee:.1f} FLOPs/byte')   # ~ 82

Exercise

Python REPL에서 4096×4096 row-major NumPy 행렬에 A.sum(axis=0) vs A.sum(axis=1) timing. 같은 FLOP 수인데 axis=1 (row 따라) 버전이 의미 있게 더 빠른 거 봐야 해. 그 다음 column-major 사본 B = np.asfortranarray(A) 만들고 다시 timing — bias 뒤집힘. coalesced GPU access가 strided보다 6× 빠른 거랑 같은 효과야.

같은 행렬, 두 layout, bandwidth 6× 차이

Roofline: bandwidth + compute 한 천장으로 합치기

Code

External links

Exercise

Progress

댓글 0