BLAS이 뭐야

모던 AI 스택을 굴리는 1979년 spec

벡터 둘 더하려고 NumPy 처음부터 다시 만들어? 아니지. 그러면 BLAS가 이미 있는데 GPU 커널 손코딩하는 이유는?

BLAS (Basic Linear Algebra Subprograms)는 numerical computing의 bedrock API. 1979년 Fortran 슈퍼컴퓨터용으로 conceived됐고 vector op (Level 1), matrix-vector op (Level 2), matrix-matrix op (Level 3) 다루는 ~60 routine spec. 오늘 거의 모든 과학 라이브러리가 직접 또는 간접적으로 그 위에 layer.

중요한 이유:

Optimization 중앙집중 — vendor가 작은 커널 셋에 수년 튜닝 쏟아부음, 모두 공짜로 가속 상속.
Portability — 1982년 Fortran 프로그램이 2025년 M3 Ultra에서 그대로 컴파일+실행. 인터페이스 안 바뀜.
Composability — NumPy, PyTorch, JAX, MLX, Core ML 다 밑에서 BLAS 호출. 위 언어가 다를 뿐 아래 커널은 같음.

Level	Operand 모양	GPU 적합성
1	벡터 ⊗ 벡터 (AXPY, dot)	나쁨 — bandwidth-bound
2	행렬 ⊗ 벡터 (GEMV)	그저 그럼 — 여전히 bandwidth-bound
3	행렬 ⊗ 행렬 (GEMM)	훌륭 — compute-bound, GPU의 존재 이유

누가 뭐 ship?

NVIDIA: cuBLAS (Tensor Core integration용 cuBLASLt도).
AMD: rocBLAS / hipBLAS.
Intel / CPU: oneAPI MKL, OpenBLAS.
Apple: Accelerate (CPU) + Metal Performance Shaders의 MPSMatrixMultiplication (GPU).

Code

같은 호출 모양, 네 backend·python

# 이 다 보닛 안에서 vendor BLAS GEMM 호출.

# NumPy (CPU) — wheel에 따라 OpenBLAS나 MKL 사용
import numpy as np
C = np.matmul(A, B)

# PyTorch CUDA — cuBLAS 호출
import torch
C = torch.matmul(A_cuda, B_cuda)

# PyTorch MPS — MPSMatrixMultiplication 호출
C = torch.matmul(A_mps, B_mps)

# MLX — MPSMatrixMultiplication 직접 호출
import mlx.core as mx
C = mx.matmul(A_m, B_m); mx.eval(C)

Exercise

NumPy 깔린 환경에서 np.show_config() 돌려. 'openblas_info'나 'mkl_info' 찾아 — NumPy가 머신에서 link한 실제 BLAS 라이브러리. Python 환경 여러 개 있으면 build info 비교. 같은 NumPy 코드가 MKL-link 빌드에서 2-3× 빠를 수 있어; NumPy 코드가 아니라 BLAS 선택 차이.

모던 AI 스택을 굴리는 1979년 spec

누가 뭐 ship?

Code

External links

Exercise

Progress

댓글 0