Vanishing 과 Exploding Gradient

같은 root cause 의 두 실패 모드

Chain rule 이 derivative 를 곱해. 그 derivative 들이 대부분 1 보다 작으면, 곱이 depth 와 함께 exponentially 0 으로 줄어들어 — vanishing gradient. 대부분 1 보다 크면, 곱이 무한대로 자라 — exploding gradient. 어느 쪽이든 deep layer 가 학습 멈춰.

Vanishing gradient 가 1990 년대 후반 dominant. Sigmoid 와 tanh activation 이 tail 에서 saturate (derivative ≈ 0), sigmoid 20 layer net 이 초기 layer 거의 train 안 됨. ReLU, careful initialization, batch normalization 이 대부분 fix, 나머지는 residual connection 이 fix.

Exploding gradient 는 보통 knob 하나로 fix

Exploding gradient 는 recurrent network (LSTM, GRU) 와 learning-rate scheduling 이 나쁜 transformer 에서 가장 자주 보임. Fix 는 gradient clipping: optimizer step 전에 gradient 의 global norm 을 어떤 threshold 에 cap. torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) 가 한 줄.

팁: Training 이 도중에 경고 없이 NaN 뱉으면 gradient explode 했어. Gradient clipping (max_norm=1.0 이 sensible default) 추가하고 증상이 사라지는지 봐.

Vanishing gradient 는 architectural answer 필요

Vanishing gradient 는 clip 못 해 — clip 할 게 없어, gradient 가 0. Fix 는 structural: ReLU/GELU activation, He/Xavier initialization, batch/layer normalization, residual connection. Modern transformer 가 4 가지 다 default 로 같이 써 — 그래서 2014 년에 불가능했을 depth 에서 train 돼.

원칙: Deep training 은 대부분 gradient chain 을 살아있게 유지하는 art. 다음 track (regularization, normalization, residual) 이 정확히 그 art.

Code

Gradient clipping in three lines·python

import torch
from torch.nn.utils import clip_grad_norm_

opt.zero_grad()
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=1.0)
opt.step()

Per-layer grad norms diagnostic·python

def layer_grad_norms(model):
    norms = []
    for name, p in model.named_parameters():
        if p.grad is not None:
            norms.append((name, p.grad.norm().item()))
    return norms

for name, n in layer_grad_norms(model)[:5]:
    print(f"{name:40} {n:.6f}")

같은 root cause 의 두 실패 모드

Exploding gradient 는 보통 knob 하나로 fix

Vanishing gradient 는 architectural answer 필요

Code

External links

Exercise

Progress

댓글 0