Admin 치트시트

Production에서 진짜로 바늘 움직이는 knob

Knob	할 일	이유
Datatype	FP16 / BF16 weight 선호; stability 요구할 때만 FP32 누적 유지	Matrix unit / Tensor Core path hit
Hidden size	64 또는 128로 round	128×128 block tile 완벽 채움; padding-fallback 회피
메모리 placement	Weight → private; activation → CPU read 필요할 때만 shared	Hot 데이터를 on-chip SRAM에 유지
Warm-up pass	Weight load 후 dummy inference 한 번	라이브러리의 best algorithm 선택 캐시
Batch size	Inference엔 메모리 허용하는 한 높게	GEMV → GEMM, bandwidth-bound → compute-bound
Algorithm caching	cuBLASLt / autotune 선택을 run 가로질러 persist	1–10s cold-start algo 검색 스킵

흔한 face-plant + 빠른 fix

증상	가능한 원인	Fix
GEMM < 40% F32 utilization	Hidden size가 16 배수 아님	M 또는 K를 64 / 128로 padding
Throughput 하룻밤에 절반	라이브러리 업그레이드 후 reduced-precision flag 켜는 걸 잊음	다시 켬, rebuild, flag 문서화
Command buffer 사이 큰 gap	CPU tokenizer나 I/O가 queue stall	Async pre-tokenize; queue에 작업 충분히 유지
라이브러리가 silent하게 GEMV path로 swap	작은 m이나 k (예: single-token decode 위해 head_dim = 32)	Head fuse 또는 KV-cache 다시 써서 batch ≥ 8 유지
Inference latency가 5× 변동	라이브러리가 shape별로 heuristic 다시 돌림	Algorithm 선택 캐시; shape별로 request 정렬

결론

GEMM이 transformer 수학의 90%. 라이브러리에 사랑하는 shape 먹여 (padded, batched, reduced-precision, hidden-size가 옳은 tile 경계), 뭔가 느려지면 utilization counter 슬쩍 봐, BLAS가 무거운 일 하게 둬. 손코딩 커널은 학습용 (이 quest!) 이랑 라이브러리가 진짜로 잘 안 다루는 shape용이지 default 아냐.

Code

Production 체크리스트를 Python pre-flight 스크립트로·python

import torch

def preflight_gemm(model_hidden_size, batch_size, dtype):
    issues = []

    if model_hidden_size % 64 != 0:
        issues.append(
            f'hidden_size={model_hidden_size}이 64 배수 아님 — '
            'padded 대비 perf 손실 20-40% 예상')

    if dtype == torch.float32 and torch.cuda.is_available():
        cap = torch.cuda.get_device_capability()
        if cap >= (8, 0):  # Ampere 이후
            issues.append(
                f'Tensor-Core 가능 GPU (cc {cap})에 FP32; '
                '~2× throughput 위해 FP16/BF16 고려')

    if batch_size < 8:
        issues.append(
            f'batch_size={batch_size} → 아마 GEMV path. '
            'Bandwidth-bound; latency budget 허용하면 batch padding')

    return issues

# 예:
for issue in preflight_gemm(model_hidden_size=4099, batch_size=1, dtype=torch.float32):
    print('WARN:', issue)

cuBLASLt heuristic 선택을 run 가로질러 캐시·python

import os
import torch

# Autotune 선택을 disk에 persist — cold start당 ~5-10s 절약
os.environ['CUBLASLT_LOG_FILE'] = '/tmp/cublaslt.log'
os.environ['CUBLASLT_HEURISTIC_CACHE_DIR'] = '/tmp/cublaslt_cache'

# PyTorch 자체 autotuner의 equivalent
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True

# 그러면 모델의 첫 warmup pass가 cache 채움; 이후 run은
# 다시 검색 안 하고 재사용.

Exercise

실제 deploy한 모델 (또는 좋아하는 오픈소스 모델) 아무거나 잡고 production-knob 표를 항목별 audit. hidden_size, batch_size, dtype 선택이 성능을 책상 위에 두는 곳 하나 찾아. Fix (또는 적어도 문서화). Hot 커널 20-40% 개선하는 fix 하나가 다른 곳 weeks micro-tuning보다 가치 — 그리고 이제 팀에 왜 그런지 설명할 어휘 가짐.

Production에서 진짜로 바늘 움직이는 knob

흔한 face-plant + 빠른 fix

결론

Code

External links

Exercise

Progress

댓글 0