torch.compile() — eager 안 포기하고 graph mode

PyTorch 2.x 의 가장 큰 단일 feature

torch.compile(model) 이 너 model 을 optimized fused kernel 로 JIT-compile 하는 한 줄 변환. 전형적 speedup: CUDA 1.5–3x, transformer-shaped model 에 자주 더. 놀라운 부분: compiled region 밖에선 eager-mode 디버깅 가능성 유지.

실제로 일어나는 일

TorchDynamo 가 너 forward 의 Python bytecode 가로채서 op 를 FX graph 에 capture.
AOTAutograd 가 backward op 포함하게 graph rewrite.
TorchInductor 가 graph 를 optimized Triton/CUDA kernel (CPU 면 C++) 로 lowering.

graph 가 첫 호출에 lazy 짓기. 너 코드가 Dynamo 가 capture 못 하는 거 (data-dependent control flow, forward 안 custom Python 객체) 하면 그 section 위 eager 로 graceful fallback — 'graph break'. compile 이 outright fail X; speedup 만 좀 잃음.

세 mode

default — compile 시간과 speedup 의 좋은 균형.
mode="reduce-overhead" — kernel launch 사이 Python overhead 최소화. 작은 model 또는 작은 batch size 에 best.
mode="max-autotune" — max 속도 위 kernel variant 철저 탐색. 컴파일 느림 (가끔 분), 실행 가장 빠름.

graph 깨는 거

tensor 값에 data-dependent control flow (if x.sum() > 0) — Python int / config 면 보통 OK.
untraced 라이브러리 호출 (forward 안 일부 OpenCV / PIL op).
Python attribute 통한 tensor 값 mutating.
일부 custom autograd Function (PyTorch 2.x 에 개선 중).

Code

model compile — 한 줄·python

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(1024, 4096), nn.GELU(),
            nn.Linear(4096, 4096), nn.GELU(),
            nn.Linear(4096, 1024),
        )
    def forward(self, x):
        return self.layers(x)

model = MLP().cuda()
model = torch.compile(model)               # done.

x = torch.randn(64, 1024, device='cuda')
y = model(x)                                # first call: slow (compiles)
y = model(x)                                # subsequent: fast

mode 고르기·python

import torch

# Default — best for most workloads
model = torch.compile(model)

# Reduce overhead — when batch is small / model is light
model = torch.compile(model, mode="reduce-overhead")

# Max autotune — exhaustive search; slow compile, fastest runtime
model = torch.compile(model, mode="max-autotune")

graph break 감지 — 진단 env var·python

import os
import torch

# Set BEFORE importing torch (or just before compiling)
os.environ['TORCH_LOGS'] = 'graph_breaks'

@torch.compile
def forward(x, mask):
    # This .item() call CAUSES a graph break — the value goes to Python
    if mask.sum().item() > 0:
        return x * 2
    return x * -1

x = torch.randn(8)
mask = torch.tensor([1, 0, 1, 0, 1, 0, 1, 0])
forward(x, mask)
# Logs will show: 'Graph break: tensor.item()'

# Fix: use torch.where for tensor-valued conditions
@torch.compile
def forward_fixed(x, mask):
    return torch.where(mask.bool(), x * 2, x * -1)

Compile inside training — drop in·python

import torch, torch.nn as nn
from torch.amp import autocast

model = MyModel().cuda()
model = torch.compile(model, mode="default")
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for x, y in loader:
        x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        with autocast(device_type='cuda', dtype=torch.bfloat16):
            out = model(x)
            loss = criterion(out, y)
        loss.backward()
        optimizer.step()

# torch.compile and bf16 autocast compose cleanly. So does DDP/FSDP.

torch.compile() — eager 안 포기하고 graph mode

PyTorch 2.x 의 가장 큰 단일 feature

실제로 일어나는 일

세 mode

graph 깨는 거

Code

External links

Exercise

Progress

댓글 0