Hands-On BLAS 데모

NVIDIA + Apple에서 4096³ GEMM — 숫자가 어떻게 보이는가

BLAS 가치 느끼는 가장 깔끔한 길은 두 생태계에서 같은 논리적 op 돌리고 throughput 숫자 나란히 읽는 거.

Level	RTX 4090 (FP32)	M3 Ultra (FP32)	M3 Ultra (FP16)
BLAS-1 (AXPY)	20 GF/s	5 GF/s	—
BLAS-2 (GEMV)	0.44 GF/s	279 GF/s (GPU)	1,136 GF/s
BLAS-3 (GEMM)	2.32 TF/s (non-Tensor-Core)	13.89 TF/s	18.15 TF/s

이 표에서 읽을 거 셋:

BLAS-1은 GPU에서 끔찍. TF/s 가능한 하드웨어에서 두 자릿수 GF/s — bandwidth-bound, 재사용 0, 가망 없음. cuBLAS가 이거 감지해서 host에서 돌리거나 Tensor Core 스킵.
여기 CUDA FP32 GEMM은 fair 비교용 non-tensor path. CUBLAS_GEMM_DEFAULT_TENSOR_OP로 cublasGemmEx Tensor Core FP16 켜면 4090이 30 TF/s 넘어 점프.
M3 Ultra GEMM이 박살냄 — matrix coprocessor가 거대 unified memory bandwidth로 먹힘. 4096³에서 테스트.

Intensity 추세는 platform-agnostic: BLAS-1 → bandwidth-bound, BLAS-3 → compute-bound. 절대 숫자는 다른데 그림 모양은 동일.

Code

벤치마크 harness — 같은 모양, 두 backend, fair 비교·python

import time
import numpy as np
import torch

N = 4096
A_cpu = np.random.randn(N, N).astype(np.float32)
B_cpu = np.random.randn(N, N).astype(np.float32)

def bench(label, A, B, runs=10, sync=lambda: None):
    sync(); start = time.perf_counter()
    for _ in range(runs):
        C = A @ B
        sync()
    elapsed = (time.perf_counter() - start) / runs
    flops = 2 * N**3
    print(f'{label}: {elapsed*1000:.2f} ms, {flops/elapsed/1e12:.2f} TFLOP/s')

# numpy (CPU via OpenBLAS / MKL)
bench('numpy CPU', A_cpu, B_cpu)

# 가능한 가속기에서 PyTorch
if torch.cuda.is_available():
    A = torch.from_numpy(A_cpu).cuda()
    B = torch.from_numpy(B_cpu).cuda()
    bench('torch CUDA FP32', A, B, sync=torch.cuda.synchronize)
    bench('torch CUDA FP16', A.half(), B.half(), sync=torch.cuda.synchronize)
elif torch.backends.mps.is_available():
    A = torch.from_numpy(A_cpu).to('mps')
    B = torch.from_numpy(B_cpu).to('mps')
    bench('torch MPS FP32', A, B, sync=torch.mps.synchronize)
    bench('torch MPS FP16', A.half(), B.half(), sync=torch.mps.synchronize)

Exercise

쓸 수 있는 박스에서 위 벤치마크 돌리고 작은 CSV (label, time_ms, tflops)에 결과 캡처. 그 숫자가 머신의 BLAS reference. 이 quest 나머지에서 쓰는 모든 손코딩 커널이 그거랑 비교돼. NVIDIA + Apple 박스 둘 다 access면 둘 다 — 두 생태계 사이 비대칭이 교훈의 일부.

NVIDIA + Apple에서 4096³ GEMM — 숫자가 어떻게 보이는가

Code

External links

Exercise

Progress

댓글 0