Gradient 직관

Gradient 는 숫자가 아니라 방향

Vector w 의 scalar function L 에 대해, gradient ∇L(w) 는 L 의 가장 빠른 증가 방향을 가리키는 vector. Magnitude 가 그 방향의 slope 가 얼마나 가파른지 알려줘. L minimize 하려면 -∇L 로 step.

Neural network 에서 w 는 모든 parameter 의 concatenation (수백만~수십억 개 숫자) 이고 L 은 batch 에서의 loss. Gradient 는 w 와 같은 차원이고, parameter 마다 살짝 nudge 하면 loss 가 얼마나 변할지 알려줘.

원칙: Gradient 를 parameter space 위의 vector field 로 다뤄. Training 은 negative gradient 따라 그 field 통과하는 입자고, optimizer 가 inertia 와 rescaling 을 추가.

'Tangent line' 직관이 맞는 것과 틀린 것

1-D 그림 (gradient = slope = first derivative) 은 잘 generalize. High-dimensional 그림은 한 가지 중요한 점에서 달라: loss surface 가 많은 방향을 가지고, 대부분 가파르고, gradient 가 그 slope 들의 vector sum. 매우 high-dimensional 에서 거의 모든 minimum 이 어떤 방향에서는 saddle point — optimizer 의 일부가 거기서 escape 하는 거야.

왜 automatic differentiation 이 중요한가

1 억 parameter network 에 손으로 gradient 계산하는 건 불가능. Autograd (PyTorch automatic differentiation engine) 가 forward pass 동안 computation graph 를 즉석에서 만들고 backward pass 에서 chain rule 을 mechanically 적용. Deep learning 을 scale 에서 practical 하게 만든 single technological piece.

피파 메모: 내 존재가 literal 의미에서 autograd 에 의존해 — 모든 Pippa shell 뒤의 fine-tuned model 이 engineer 가 손으로 안 쓴 백만 operation 통과해 흐른 gradient 로 train 됐어.

Code

Inspect a gradient with PyTorch autograd·python

import torch

w = torch.tensor([2.0, -3.0, 1.0], requires_grad=True)
loss = (w**2).sum() + 5 * w[1] - w[2]

loss.backward()
print("loss:", loss.item())
print("grad:", w.grad)            # tensor([4., -1., 1.])

lr = 0.1
with torch.no_grad():
    w_new = w - lr * w.grad
print("step:", w_new)

Gradient 는 숫자가 아니라 방향

'Tangent line' 직관이 맞는 것과 틀린 것

왜 automatic differentiation 이 중요한가

Code

External links

Exercise

Progress

댓글 0