Reduction — 행렬 수학의 심장박동

Reduction은 GPU 압축

Reduction은 병렬 값들을 하나 (또는 작은 한 줌)로 collapse. Sum, max, mean, norm, dot product, softmax 분모 — 다 reduction. 자기 트랙 레슨 받을 자격 있는 이유: 모든 신경망에 가득하고, reduction 잘못하면 hot-path 시간 50%+ 낭비.

이기는 레시피는 계층적:

Register — 각 thread가 자기 partial sum을 register에 누적 (메모리 traffic 0).
Warp-level shuffle — __shfl_down_sync로 warp 안 partial 32개 결합. shared-memory traffic 0.
Shared-memory tree — 각 warp가 자기 sum을 shared-memory slot에 write, 그 다음 slot 가로질러 tree reduction으로 block sum.
Block 결과 — global 누적기에 atomic add 또는 block sum reduce하는 두 번째 커널 launch.

LLM이 reduction 예산 쓰는 곳:

Attention softmax — row max + exp(x - max) sum per row. attention head당 reduction 둘.
LayerNorm / RMSNorm — token당 mean과 (root-)mean-square, 모든 layer.
Loss 계산 — logit 수십억 → scalar 하나.
Gradient accumulation — micro-batch gradient를 optimizer step 전에 합산.

Code

Shuffle로 warp-level reduction (CUDA)·cuda

// Shared memory 없이 warp의 32 thread 가로질러 sum.
__device__ float warp_reduce_sum(float v) {
    for (int off = warpSize / 2; off > 0; off >>= 1) {
        v += __shfl_down_sync(0xffffffff, v, off);
    }
    return v;   // warp의 thread 0이 이제 warp sum 보유
}

__global__ void block_sum(const float* in, float* out, int N) {
    extern __shared__ float warp_sums[];   // block의 warp당 slot 하나
    int tid = threadIdx.x;
    int gid = blockIdx.x * blockDim.x + tid;
    int lane = tid % warpSize;
    int warp = tid / warpSize;

    float v = (gid < N) ? in[gid] : 0.f;
    v = warp_reduce_sum(v);                // register에서 warp sum

    if (lane == 0) warp_sums[warp] = v;    // 첫 lane이 shared write
    __syncthreads();

    // 첫 warp가 warp sum들을 reduce
    if (warp == 0) {
        v = (lane < blockDim.x / warpSize) ? warp_sums[lane] : 0.f;
        v = warp_reduce_sum(v);
        if (lane == 0) atomicAdd(out, v);  // global에 기여
    }
}

Metal에서 SIMD-group reduction (Apple equivalent)·metal

#include <metal_stdlib>
using namespace metal;

// simd_sum이 warp shuffle reduction의 built-in equivalent.
// SIMD-group의 32 thread가 값 공유, sum return.
kernel void block_sum(const device float *in       [[buffer(0)]],
                      device       float *out      [[buffer(1)]],
                      constant     uint  &N        [[buffer(2)]],
                      threadgroup  float *warp_sums [[threadgroup(0)]],
                      uint gid                     [[thread_position_in_grid]],
                      uint tid                     [[thread_index_in_threadgroup]],
                      uint simd_lane               [[thread_index_in_simdgroup]],
                      uint simd_id                 [[simdgroup_index_in_threadgroup]])
{
    float v = (gid < N) ? in[gid] : 0.f;
    v = simd_sum(v);                       // SIMD-group sum
    if (simd_lane == 0) warp_sums[simd_id] = v;
    threadgroup_barrier(mem_flags::mem_threadgroup);

    if (simd_id == 0) {
        v = (simd_lane < /* warp 수 */ 8) ? warp_sums[simd_lane] : 0.f;
        v = simd_sum(v);
        if (simd_lane == 0) atomic_fetch_add_explicit(
            (device atomic_float *)out, v, memory_order_relaxed);
    }
}

Exercise

쓸 수 있는 GPU에서 위 block_sum 구현하고 'thread 하나가 N 원소 다 read' naive 구현이랑 벤치마크. 큰 입력에선 계층적 버전이 50–100× 빨라야 함 — 진짜로 parallelism 쓰니까. 그 다음 NumPy / PyTorch / MLX의 동등 거랑 timing — framework의 2× 안에 보통 들어가, 즉 계층적 reduction이 경쟁력 있음.

Reduction — 행렬 수학의 심장박동

Reduction은 GPU 압축

Code

External links

Exercise

Progress

댓글 0