Profiling — 진짜 bottleneck 찾기

측정 안 한 거 최적화 X

PyTorch profiler 가 op-level timing 을 CPU 와 GPU 에 capture, Chrome-trace JSON export, (profile_memory=True 면) tensor 할당 추적. '내 training time 이 실제로 어디로 가지' 와 '왜 OOM 나지' 답할 옳은 도구.

두 profiler, 두 목적

torch.profiler.profile — modern, 포괄적. CPU + CUDA timing, 메모리 추적, Chrome trace export. 옳은 default.
torch.utils.bottleneck — 더 오래된, 가벼운 Python wrapper. 전체 profiler setup 없이 quick '뭐 느려' 답에 유용.

Chrome trace

profiler 가 Chrome 의 chrome://tracing viewer (또는 perfetto.dev) 에 로드되는 JSON export 가능. 모든 CPU 와 GPU op 의 timeline, duration 과 call relationship 와 함께 봐. '거대 cudaStreamSynchronize 있다' 또는 '이 작은 op 가 10,000 번 dispatch' 가 시각적으로 명백.

Reproducibility — orthogonal 이지만 알 가치

bit-for-bit deterministic run (디버깅 또는 paper-style ablation) 원하면 모든 RNG seed 설정 AND PyTorch 에 deterministic algorithm 사용 명령. 무료 X — 일부 op 가 deterministic 구현 없고, 다른 거 deterministic mode 에서 느림. 디버깅에 사용, production training 아님.

Code

torch.profiler — basic CPU+GPU profile·python

import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    with record_function("model_inference"):
        out = model(input_data)

# Top ops by GPU time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# Export Chrome trace — open in chrome://tracing or perfetto.dev
prof.export_chrome_trace("/tmp/trace.json")

training loop 위 stepped profiler·python

import torch
from torch.profiler import profile, schedule, tensorboard_trace_handler, ProfilerActivity

# schedule(wait, warmup, active, repeat) — only profile some steps
sched = schedule(wait=2, warmup=2, active=4, repeat=1)

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule=sched,
    on_trace_ready=tensorboard_trace_handler('/tmp/profile'),
) as prof:
    for step, (x, y) in enumerate(loader):
        if step >= 10: break
        out = model(x.cuda())
        loss = criterion(out, y.cuda())
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        prof.step()                 # tells the profiler we crossed a step boundary

메모리 snapshot — leak 찾기·python

import torch

# Begin recording allocator history
torch.cuda.memory._record_memory_history(max_entries=100_000)

# ... run your training for a while ...

# Dump a snapshot — visualize at https://pytorch.org/memory_viz
torch.cuda.memory._dump_snapshot('/tmp/memory.pickle')

# Quick numeric summary
print(torch.cuda.memory_summary(abbreviated=True))
print(f"allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"reserved : {torch.cuda.memory_reserved()/1e9:.2f} GB")

Reproducibility — dial 끝까지·python

import os, random, torch
import numpy as np

def seed_everything(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # Force deterministic behavior even at a perf cost
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True, warn_only=True)
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

seed_everything(42)

Exercise

어떤 model 이든 한 training step 에 torch.profiler 돌리기. 결과 Chrome trace 를 perfetto.dev 에 open. 가장 긴 GPU op 와 가장 긴 CPU op 찾기. 한 최적화 (compile? 더 큰 batch? num_workers?) 시도하고 re-profile. before/after 숫자 문서화 — 그 습관이 최적화와 미신의 차이.