AXPY, GEMV, GEMM — 단 세 op · GPU Compute Quest

BLAS Level 1, 2, 3 — 그리고 어느 게 GPU 밥값 하는가

벡터랑 행렬 있으면 BLAS가 모든 걸 정리하는 op은 셋뿐:

Op	Level	시그니처	커널 이름	전형적 용도
AXPY	BLAS-1	`y ← α·x + y`	SAXPY, VectorAdd	Bias add, residual skip
GEMV	BLAS-2	`y ← A·x + y`	MatVec, DenseInference	Single-token autoregressive inference
GEMM	BLAS-3	`C ← α·A·B + β·C`	matmul, mma, wmma	Self-attention, batched inference, training

왜 BLAS Level 3가 1, 2 보다 훨씬 중요한가? Arithmetic intensity. AXPY는 byte당 ~1 FLOP — 모든 연산이 fresh DRAM trip 요구. GEMV도 비슷. 반면 GEMM은 load한 element 각각을 O(k)번 써서 GPU가 부수도록 만들어진 FLOPs/byte 비율을 가져.

Roofline 어림 룰:

낮은 FLOPs/byte (< 4) — bandwidth-bound. 커널이 DRAM 기다림. AXPY, GEMV, element-wise op이 여기 살아.
높은 FLOPs/byte (> 8) — compute-bound. ALU / Tensor Core 포화. Tiled GEMM이 여기 살아.

그래서 7B 모델의 single-token decode (GEMV 모양: activation × weight)이 RTX 4090에서 50 tokens/s, 64 토큰 batch (GEMM 모양)는 800 tokens/s. 수학은 같고, FLOPs/byte 비율이 바뀐 거야.

Code

세 op, NumPy로 나란히·python

import numpy as np

# AXPY — vector scale + add (BLAS-1)
alpha = 0.5
x = np.random.randn(1_000_000)
y = np.random.randn(1_000_000)
y = alpha * x + y          # 원소당 1 FLOP ≈ 1 FLOP/byte

# GEMV — 행렬 × 벡터 (BLAS-2)
A = np.random.randn(4096, 4096)
x = np.random.randn(4096)
y = A @ x                  # ~ 2N² FLOP / ~N² byte ≈ 2 FLOPs/byte

# GEMM — 행렬 × 행렬 (BLAS-3)
A = np.random.randn(4096, 4096)
B = np.random.randn(4096, 4096)
C = A @ B                  # ~ 2N³ FLOP / ~N² byte ≈ N FLOPs/byte

Arithmetic intensity 계산기 (back-of-envelope)·python

def intensity(flops, bytes_):
    return flops / bytes_

# 4096 × 4096 GEMM, FP32 (원소당 4 byte)
N = 4096
flops = 2 * N**3
bytes_ = 3 * N * N * 4         # A, B, C 각 N²
print(f'GEMM intensity: {intensity(flops, bytes_):.0f} FLOPs/byte')
# ~ 1365  → 매우 compute-bound, GPU가 사랑

# 같은 모양 vector add
flops = N * N
bytes_ = 3 * N * N * 4
print(f'AXPY intensity: {intensity(flops, bytes_):.2f} FLOPs/byte')
# ~ 0.08 → 심각하게 bandwidth-bound

Exercise

위 intensity 계산기를 1024×1024 / 8192×8192 GEMM에 돌려봐. intensity가 N에 선형으로 증가하는 거 확인 (compute는 N³, byte는 N²). 큰 GEMM이 작은 거보다 compute-bound 만들기 쉬운 대수적 이유 — 그리고 작은 행렬을 'good' 사이즈로 padding 하는 게 zero work 늘려도 가끔 이기는 이유.

AXPY, GEMV, GEMM — 단 세 op

BLAS Level 1, 2, 3 — 그리고 어느 게 GPU 밥값 하는가

Code

External links

Exercise

Progress

댓글 0