Hello Metal 예제

'각 thread가 자기 좌표를 buffer에 쓰는' 패턴

Metal은 GPU printf가 없어. 대신 관습은: 각 thread가 자기 정체성을 device buffer에 쓰고, command buffer 끝나면 CPU가 buffer 읽어서 print. 장황한데 이게 어차피 모든 실전 워크로드에서 해야 할 round-trip을 강제하는 거야.

파일 둘: hello.metal (커널)이랑 hello_host.swift (launch driver + 결과 printer). metal + metallib + swiftc로 빌드, single binary로 실행.

Code

hello.metal — 커널·metal

#include <metal_stdlib>
using namespace metal;

kernel void hello(
    device uint2 *out               [[buffer(0)]],
    uint  tid_in_tg                 [[thread_index_in_threadgroup]],
    uint  tg_id_in_g                [[threadgroup_position_in_grid]])
{
    const uint threadsPerTG = 4;
    uint gid = tg_id_in_g * threadsPerTG + tid_in_tg;
    out[gid] = uint2(tg_id_in_g, tid_in_tg);
}

hello_host.swift — Swift driver·swift

import Metal
import Foundation

func main() throws {
    guard let device = MTLCreateSystemDefaultDevice() else {
        fatalError("No Metal-capable GPU")
    }
    let lib = try device.makeLibrary(URL: URL(fileURLWithPath: "hello.metallib"))
    let fn = lib.makeFunction(name: "hello")!
    let pipeline = try device.makeComputePipelineState(function: fn)

    let threadsPerTG = 4
    let threadgroups = 1
    let total = threadsPerTG * threadgroups

    let outBuf = device.makeBuffer(
        length: total * MemoryLayout<SIMD2<UInt32>>.stride,
        options: .storageModeShared)!

    let queue = device.makeCommandQueue()!
    let cb = queue.makeCommandBuffer()!
    let enc = cb.makeComputeCommandEncoder()!
    enc.setComputePipelineState(pipeline)
    enc.setBuffer(outBuf, offset: 0, index: 0)
    enc.dispatchThreadgroups(
        MTLSize(width: threadgroups, height: 1, depth: 1),
        threadsPerThreadgroup: MTLSize(width: threadsPerTG, height: 1, depth: 1))
    enc.endEncoding()
    cb.commit(); cb.waitUntilCompleted()

    let p = outBuf.contents().bindMemory(to: SIMD2<UInt32>.self, capacity: total)
    for i in 0..<total {
        print("Hello from threadgroup \(p[i].x), thread \(p[i].y)")
    }
}

do { try main() } catch { print("Error: \(error)"); exit(1) }

Build + run·bash

mkdir -p build
xcrun -sdk macosx metal -c hello.metal -o build/hello.air
xcrun -sdk macosx metallib build/hello.air -o build/hello.metallib
xcrun -sdk macosx swiftc hello_host.swift \
    -framework Metal -framework Foundation \
    -o build/hello_host

cd build && ./hello_host
# Hello from threadgroup 0, thread 0
# Hello from threadgroup 0, thread 1
# Hello from threadgroup 0, thread 2
# Hello from threadgroup 0, thread 3

Exercise

두 파일 저장, build 명령 셋 돌려서 Hello 4줄 print 확인. 그 다음 hello_host.swift에서 threadgroups = 1을 threadgroups = 2로 바꾸고 hello.metal에서 threadsPerTG를 8로 바꿔 (응, 둘 다 — 일치해야 해). 다시 빌드, 실행. 16줄 나와야 하고, 순서 안 맞을 수도. 이 mismatch hint 일부러야: Metal에서 커널의 hard-coded threadsPerTG랑 host의 threadsPerThreadgroup dispatch arg가 일치해야 하고, 커널 쪽 빠뜨리는 게 Metal top 3 버그 중 하나.

'각 thread가 자기 좌표를 buffer에 쓰는' 패턴

Code

External links

Exercise

Progress

댓글 0