Hello CUDA 예제

전통의 첫 커널 — 모든 GPU thread가 자기 좌표 출력

이게 toolchain end-to-end 입증하는 가장 작은 CUDA 프로그램이야. driver가 커널 load, runtime이 thread launch, thread가 device 코드 실행, cudaDeviceSynchronize가 출력을 CPU stdout으로 flush.

source 꼼꼼히 읽어. 모든 줄이 lesson 1의 CUDA 확장 셋 중 하나에 대응:

__global__ void say_hello() — 이게 커널이라고 표시하는 function qualifier.
say_hello<<<1, 4>>>() — triple-angle launch: block 1개, thread 4개.
blockIdx.x, threadIdx.x — thread마다 정체성 주는 SIMT built-in.

launch 문법 <<<grid, block>>>는 template 문법처럼 보이는 CUDA 구조 중 하나인데 사실 아냐. NVCC가 special launch expression으로 파싱해. NVCC pre-pass 후 일반 C++ 컴파일러가 보는 건 grid/block 차원이 인자로 packed된 runtime API 호출이야.

Code

hello.cu — 가장 작은 CUDA 프로그램·cuda

#include <cstdio>
#include <cuda_runtime.h>

__global__ void say_hello() {
    printf("Hello from block %d, thread %d\n",
           blockIdx.x, threadIdx.x);
}

int main() {
    say_hello<<<1, 4>>>();           // block 1, thread 4
    cudaError_t err = cudaDeviceSynchronize();
    if (err != cudaSuccess) {
        fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
        return 1;
    }
    return 0;
}

Build + run·bash

# Linux / WSL — sm_89는 자기 compute capability로 바꿔
nvcc -arch=sm_89 hello.cu -o hello
./hello
# Hello from block 0, thread 0
# Hello from block 0, thread 1
# Hello from block 0, thread 2
# Hello from block 0, thread 3

# Windows (Developer Command Prompt for VS 2022)
nvcc -arch=sm_89 hello.cu -o hello.exe
hello.exe

Exercise

hello.cu 저장, 컴파일, 실행. 그 다음 launch를 <<<1, 4>>>에서 <<<2, 8>>> (block 2 × thread 8 = 16개 print)로 바꿔. 다시 컴파일, 실행. line 순서가 (block 0 thread 0, block 0 thread 1, ..., block 1 thread 7) 예상대로 안 나올 수도. blockIdx.x + threadIdx.x 값 읽어서 GPU 스케줄러가 block을 interleave했는지 확인. 이게 GPU 디버깅에 fence + thread ID 박은 print가 왜 필요한지 첫 맛보기야.

전통의 첫 커널 — 모든 GPU thread가 자기 좌표 출력

Code

External links

Exercise

Progress

댓글 0