Backpropagation: Chain Rule

한 단락의 backprop

Backpropagation 은 computation graph 에 적용된 chain rule. Network L = f₃(f₂(f₁(w₀; x))) 에서, network 깊은 곳 parameter 에 대한 loss 의 gradient 는 그 parameter 에서 loss 까지 path 위 local Jacobian 의 곱. Autograd 가 forward pass 동안 graph 기록, backward 가 역방향으로 walk 하면서 local Jacobian 곱하고 각 parameter 에 gradient 누적.

거의 손으로 backprop 안 써. PyTorch 와 JAX 둘 다 operation 의 directed acyclic graph 위 generic algorithm 으로 구현, 각 operation 이 자기 forward 와 backward 등록. Backprop 깊이 배우는 이유는 그래도 — 어떤 architecture 가 train 되고 어떤 게 안 되는지 알려줘.

Chain rule, picture form

z = g(y), y = f(x) 면, dz/dx = (dz/dy) · (dy/dx). Deep stack z = g(f(h(...x...))) 에서 x 의 gradient 는 chain 위 모든 local derivative 의 곱. 각 local derivative 가 작거나 적당, 그 곱은 vanish 하거나 explode — 그래서 deep network 가 normalization, residual connection, 좋은 initialization 필요.

팁: 모든 modern stability trick — He init, BatchNorm, residual connection, gradient clipping — 이 그 긴 chain 곱을 well-conditioned 유지하기 위해 존재. 옵셔널 장식 아니라, depth 에서 backprop 작동시키는 이유야.

Autograd 가 무료로 주는 것

Forward pass 정의. Gradient 원하는 leaf 에 requires_grad=True 설정. Scalar loss 에 .backward() 호출. 모든 leaf 에서 .grad 읽기. 그게 user interface 전부. 안에서 PyTorch 가 graph 만들고, 역방향 walk, chain rule 적용, 결과 누적했어.

원칙: Backprop 는 작은 숫자의 긴 chain 을 곱해서 fragile. 다음 track 의 architecture 결정 (normalization, residual connection, careful init) 이 다 그 곱을 depth 에서 살아있게 유지하는 거야.

Code

Backprop by hand vs autograd, on a tiny graph·python

import torch

x = torch.tensor(3.0, requires_grad=True)
a = x ** 2
b = torch.exp(x)
c = a + b
L = c * 2

# By hand: dL/dx = 2 * (2x + e^x) = 2*(6 + e^3)
manual = 2 * (2 * 3 + torch.exp(torch.tensor(3.0))).item()

L.backward()
auto = x.grad.item()

print(f"manual: {manual:.4f}")
print(f"autograd: {auto:.4f}")

Vanishing gradient diagnostic·python

import torch, torch.nn as nn

class DeepSig(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(*[
            nn.Sequential(nn.Linear(64, 64), nn.Sigmoid())
            for _ in range(50)
        ])
        self.head = nn.Linear(64, 1)
    def forward(self, x):
        return self.head(self.layers(x))

m = DeepSig()
x = torch.randn(8, 64)
y = m(x).sum()
y.backward()

first = list(m.layers[0][0].parameters())[0].grad.norm().item()
last  = list(m.layers[-1][0].parameters())[0].grad.norm().item()
print(f"first-layer grad norm: {first:.6f}")
print(f"last-layer grad norm:  {last:.6f}")

한 단락의 backprop

Chain rule, picture form

Autograd 가 무료로 주는 것

Code

External links

Exercise

Progress

댓글 0