Mixed Precision

왜 와 어떻게 (recap)

Modern accelerator 가 BF16/FP16 multiply 를 FP32 보다 2-8x 빠르게. Mixed precision 이 model weight 를 FP32 (master copy) 로 유지하지만 heavy compute (matmul) 를 lower precision 으로. 결과 — 99% case 에 같은 final accuracy 의 빠른 training.

BF16: FP32 와 같은 dynamic range, 적은 precision, GradScaler 안 필요, 2026 default 선호. FP16: 더 많은 precision, 작은 dynamic range, underflow 없는 작은 gradient 처리에 GradScaler 필요.

팁: Hardware 가 BF16 (H100, A100, M3+) 지원하면 BF16 default. GradScaler complexity 건너뜀. Native BF16 없는 hardware (RTX 3000 시리즈 이전 consumer GPU) 에 stuck 할 때만 FP16 으로 fall back.

안 될 때 뭐 잘못 가는가

NaN loss. 가장 흔한 원인: FP16 forward pass 어딘가의 single underflow 가 loss 통해 propagate. 보통 BF16 로 switch, 또는 FP16 stuck 면 GradScaler 추가로 fix.

미묘한 accuracy degradation. 일부 operation (log-softmax, layer norm 등) 이 lower precision 에 sensitive. PyTorch autocast 가 자동으로 FP32 유지하는데 custom layer 는 안 그럴 수도. Autocast 에서 model accuracy 가 noticeably 떨어지면 어떤 op 가 low precision 에서 도는지 profile.

숫자

BF16 training: Tensor Core 위 FP32 대비 1.5-3x wall-clock 빠름, activation 의 ~50% memory 절약. FP8 (H100 이상) 이 transformer training 에 추가 30-50% 가능, 2026 년 tooling 여전히 maturing.

원칙: Mixed precision 이 modern hardware 에 본질적으로 무료. 안 쓰면 좋은 이유 없이 시간과 memory 2x 내고 있어.

Code

BF16 mixed precision (no scaler needed)·python

import torch
from torch.amp import autocast

device = torch.device("cuda")
model.to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)

for xb, yb in train_loader:
    xb, yb = xb.to(device), yb.to(device)
    opt.zero_grad()
    with autocast(device_type="cuda", dtype=torch.bfloat16):
        logits = model(xb)
        loss = loss_fn(logits, yb)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    opt.step()

왜 와 어떻게 (recap)

안 될 때 뭐 잘못 가는가

숫자

Code

External links

Exercise

Progress

댓글 0