CUDA ↔ Metal 치트시트

같은 군단, 다른 군복

이 표 내재화하면 어느 코드베이스든 5분에 읽혀. 하드웨어 개념은 동일, spell만 달라.

CUDA	Metal	비고
`__global__ void kernel(...)`	`kernel void kernel(...)`	CPU가 launch하는 GPU entry.
`__device__ float helper()`	(qualifier 없음 — non-kernel default)	GPU-only helper.
`kernel<<<grid, block>>>()`	`encoder.dispatchThreadgroups(grid, threadsPerTG)`	같은 2단 launch geometry.
`blockIdx.x` / `threadIdx.x`	`[[threadgroup_position_in_grid]]` / `[[thread_index_in_threadgroup]]`	SIMT 인덱스.
warp (32 thread)	SIMD-group (32 thread)	Divergence + shuffle 수학 동일.
`__shared__` 메모리	`threadgroup` 메모리	(thread)group이 공유하는 on-chip SRAM scratchpad.
`__constant__` 메모리	`constant` address space	작은 read-only LUT.
`__syncthreads()`	`threadgroup_barrier(mem_flags::mem_threadgroup)`	Local sync + 선택적 memory fence.
`cudaMalloc` / `cudaMemcpy`	`device.makeBuffer(.storageModeShared)`	Apple Silicon: copy 필요 없음.
cuBLAS	MPSMatrixMultiplication	Vendor BLAS, 둘 다 이론치 ~80% 도달.
nvcc	xcrun metal + metallib	2단 빌드 (compile, link to .metallib).

진짜 다른 점 몇 개:

Build 파이프라인. CUDA: nvcc 한 명령으로 실행파일. Metal: metal이 .air로 컴파일, metallib이 .metallib으로 link, swiftc가 host driver 빌드, runtime에 .metallib load.
printf vs buffer. CUDA는 GPU printf 있음. Metal은 buffer round-trip 강제. buffer 패턴이 어차피 production 모양에 더 가까움.
Tensor unit. 둘 다 있음 — NVIDIA Tensor Core (mma.sync), Apple matrix coprocessor. 둘 다 vendor BLAS가 wrap.
메모리 모델. Discrete VRAM vs unified RAM (이 트랙 lesson 2).

Code

같은 vector add, 두 언어 — 나란히 읽어·cuda

// CUDA
__global__ void vec_add(const float* A, const float* B,
                        float* C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) C[i] = A[i] + B[i];
}

같은 vector add, Metal 버전·metal

// Metal
kernel void vec_add(
    const device float *A [[buffer(0)]],
    const device float *B [[buffer(1)]],
    device       float *C [[buffer(2)]],
    constant     uint  &N [[buffer(3)]],
    uint i                [[thread_position_in_grid]])
{
    if (i < N) C[i] = A[i] + B[i];
}

같은 군단, 다른 군복

Code

External links

Exercise

Progress

댓글 0