cuBLAS & MPS 치트시트

두 라이브러리, 한 API 모양

cuBLAS랑 MPSMatrixMultiplication은 다른 회사의 다른 product인데 본질적으로 같은 op 구현. 외울 만한 차이:

측면	cuBLAS (NVIDIA)	MPS (Apple)
Layout default	Column-major (Fortran)	Row-major
Default 정밀도 focus	FP32, cublasGemmEx로 FP16/BF16	Half-precision-first; FP32도 지원
Lifecycle	Thread당 handle, default async	Dispatch당 command buffer, `waitUntilCompleted`에 sync
α/β scalar	응 (full GEMM 시그니처)	응
Algorithm picker	cublasLt로 heuristic; algo ID로 manual	라이브러리가 픽; 직접 통제 덜

cuBLAS 길의 룰

Thread당 (또는 CUDA stream당) cublasHandle_t 하나 만들어; 호출 가로질러 재사용.
Default async — 다음 CPU 코드 전에 결과 필요하면 cudaStreamSynchronize 호출.
Column-major 수학이라 대부분 C/C++ 사용자가 buffer transpose 대신 CUBLAS_OP_T로 Cᵀ = Bᵀ · Aᵀ 계산.

MPS 길의 룰

Default row-major; numpy / Swift array가 자연스럽게 저장하는 방향이랑 동일.
Command buffer lifecycle 명시적: encode, commit, wait. wait 단계 비선택.
Half-precision MPSMatrix가 first-class — inference엔 numerical stability 요구 없으면 FP16 선호.

Code

cuBLAS sgemm — column-major 춤·cuda

#include <cublas_v2.h>

cublasHandle_t handle;
cublasCreate(&handle);

const float alpha = 1.f, beta = 0.f;
// 논리적: C = A·B, A는 (m,k), B는 (k,n), C는 (m,n), 다 row-major.
// cuBLAS column-major: Cᵀ = Bᵀ · Aᵀ 계산 → operand 순서 swap, OP_T 사용:
cublasSgemm(handle,
            CUBLAS_OP_T, CUBLAS_OP_T,    // op(A) = Aᵀ, op(B) = Bᵀ
            n, m, k,                     // 결과는 column-major (n × m) = row-major (m × n)
            &alpha,
            dB, n,                       // ldb = B 원래 cols (= n)
            dA, k,                       // lda = A 원래 cols (= k)
            &beta,
            dC, n);

cudaDeviceSynchronize();
cublasDestroy(handle);

MPSMatrixMultiplication — Swift, row-major, half precision·swift

import Metal
import MetalPerformanceShaders

let device = MTLCreateSystemDefaultDevice()!
let queue = device.makeCommandQueue()!

// 기존 MTLBuffer를 MPSMatrix descriptor로 wrap.
let descA = MPSMatrixDescriptor(rows: m, columns: k,
    rowBytes: k * MemoryLayout<Float16>.stride, dataType: .float16)
let descB = MPSMatrixDescriptor(rows: k, columns: n,
    rowBytes: n * MemoryLayout<Float16>.stride, dataType: .float16)
let descC = MPSMatrixDescriptor(rows: m, columns: n,
    rowBytes: n * MemoryLayout<Float16>.stride, dataType: .float16)

let A = MPSMatrix(buffer: bufA, descriptor: descA)
let B = MPSMatrix(buffer: bufB, descriptor: descB)
let C = MPSMatrix(buffer: bufC, descriptor: descC)

let mm = MPSMatrixMultiplication(device: device,
    transposeLeft: false, transposeRight: false,
    resultRows: m, resultColumns: n,
    interiorColumns: k,
    alpha: 1.0, beta: 0.0)

let cb = queue.makeCommandBuffer()!
mm.encode(commandBuffer: cb, leftMatrix: A, rightMatrix: B, resultMatrix: C)
cb.commit()
cb.waitUntilCompleted()

Exercise

쓸 수 있는 쪽 골라서 1024×1024 random 행렬 둘 곱하는 'hello cuBLAS' 또는 'hello MPS' GEMM 작성. numpy.dot 출력이랑 1e-5 tolerance 안에서 element-wise 비교. 호출 timing (데이터 전송 제외). 4090에서 ~3–7 TF/s FP32, M3 Ultra에서 ~13 TF/s FP32 / ~18 TF/s FP16 보여야. 숫자 메모; Track 8에서 손코딩 tiled GEMM이랑 비교할 거.