Gradient Clipping

한 줄의 보험

Gradient clipping 은 optimizer step 전 gradient 의 global norm (또는 value) 을 cap. Unclipped norm 이 max_norm 초과하면 gradient 를 norm 이 max_norm 와 같게 scale down. Optimizer 가 bounded update 보고 wild step 못 해.

표준 default: transformer 와 RNN 에 max_norm=1.0, 매우 unstable training 에 max_norm=0.5, stable CNN training 에 clipping 없음. 비용은 step 당 작은 compute — 거의 무료.

팁: Training 이 경고 없이 NaN loss 뱉으면, 첫 patch 가 gradient clipping 추가. 둘째가 clip 전 gradient norm log 해서 얼마나 자주 본인을 살리는지 보기.

Norm clipping vs value clipping

Norm clipping (clip_grad_norm_) — 방향 보존, magnitude scale. 거의 항상 본인이 원하는 거.

Value clipping (clip_grad_value_) — 각 coordinate 독립적 cap. Gradient 방향 distort, RL 의 PPO 같은 specific 일 하는 거 아니면 거의 본인이 원하는 거 아냐.

Call 어디 둘지

loss.backward() 와 opt.step() 사이. Mixed precision 과 함께면, scaler.unscale_(opt) 와 scaler.step(opt) 사이. Clipping 이 actual gradient value 에 작동하니까 순서가 중요해.

원칙: Gradient clipping 이 싼 보험. 모든 training loop 에 default 로 두기, specific 이유 있을 때만 제거. Step 50,000 의 NaN 비용이 한 줄 추가 비용보다 훨씬 큼.

Code

Gradient clipping with mixed precision·python

import torch
from torch.cuda.amp import autocast, GradScaler
from torch.nn.utils import clip_grad_norm_

scaler = GradScaler()

opt.zero_grad()
with autocast():
    loss = loss_fn(model(xb), yb)

scaler.scale(loss).backward()
scaler.unscale_(opt)                # bring grads back to FP32 scale
clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(opt)
scaler.update()

한 줄의 보험

Norm clipping vs value clipping

Call 어디 둘지

Code

External links

Exercise

Progress

댓글 0