Mixed Precision 과 Reproducibility

Mixed precision 이 뭔가

Modern GPU 가 float16 (FP16) 또는 bfloat16 (BF16) operation 을 float32 (FP32) 보다 2-8x 빠르게 돌려. Mixed precision 은: forward pass 를 FP16/BF16 로 cast, FP32 의 master weight copy 유지, FP16 gradient 의 작은 dynamic range 처리에 GradScaler 사용. 결과 — 더 빠른 training, 더 적은 VRAM, 보통 accuracy hit 없음.

BF16 은 FP32 와 같은 exponent range (그래서 GradScaler 안 필요) 지만 lower precision. FP16 은 더 precise 한데 smaller range (underflow 피하려 GradScaler 필요). H100 / A100 에서는 둘 다 빠름, Apple Silicon 에서는 BF16 이 보통 올바른 pick.

팁: Modern accelerator 에서 train 하는데 mixed precision 안 쓰면, 좋은 이유 없이 시간과 VRAM 2x 내고 있어. torch.autocast + GradScaler 가 한 줄 upgrade.

표준 recipe

Forward 와 loss 를 torch.autocast(device_type="cuda", dtype=torch.bfloat16) 로 wrap. FP16 면 GradScaler 도: scaler.scale(loss).backward(), scaler.unscale_(opt), 다음 clip, 다음 scaler.step(opt), 다음 scaler.update().

Reproducibility 주의

Mixed precision 이 seed set 해도 작은 non-determinism 추가 — FP16/BF16 의 rounding pattern 이 operation order 에 의존. 정확한 final metric ± 0.1% 의 statistical reproducibility 계획, bitwise 아님. 정확한 determinism 필요한 application 은 FP32 에서 돌리고 비용 받아들여.

원칙: Mixed precision 이 modern hardware 에서 본질적으로 무료 accuracy + speed. 건너뛰는 valid 이유는: pre-Volta GPU (Tensor Core 없음), 또는 bitwise determinism 이 hard requirement 인 application.

Code

Mixed-precision training in 2026 PyTorch·python

import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

for xb, yb in train_loader:
    xb, yb = xb.to(device), yb.to(device)
    opt.zero_grad()

    with autocast(dtype=torch.bfloat16):
        logits = model(xb)
        loss = loss_fn(logits, yb)

    # FP16 needs GradScaler; BF16 doesn't (but it's harmless to use)
    scaler.scale(loss).backward()
    scaler.unscale_(opt)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(opt)
    scaler.update()

Mixed precision 이 뭔가

표준 recipe

Reproducibility 주의

Code

External links

Exercise

Progress

댓글 0