한 Train Step

5 줄 의식

Training 의 unit 은 한 step. 항상 같은 5 줄, 같은 순서:

opt.zero_grad() — 이전 step 의 gradient 클리어.
logits = model(xb) — forward pass.
loss = loss_fn(logits, yb) — scalar loss.
loss.backward() — autograd 로 gradient 계산.
opt.step() — gradient 를 parameter 에 적용.

Optionally backward 와 step 사이에 clip_grad_norm_ 끼우고, step 다음에 scheduler.step(). 그게 전체 loop body. 다른 모든 건 그 주변 plumbing.

팁: 5 줄 의식을 외워서 화이트보드에 못 그리면 아직 muscle 없는 거. 20 번 써. 결국 문장 끝나기 전에 opt.zero_grad() 치게 돼.

Step 1 의 유명한 두 버그

zero_grad() 잊음 — gradient 가 step 사이 누적, 각 step 이 모든 이전 gradient 합 봄. Loss 높이 머물고, training collapse. Fix 는 한 줄.

loss.backward() 전에 opt.step() — stale gradient (이전 batch 의) 에 step. Loss curve 가 막연히 training 처럼 보이는데 model 이 실제로 현재 batch 학습 안 함.

Gradient accumulation (`zero_grad` 건너뛰는 올바른 방법)

때때로 VRAM 에 들어가는 것보다 큰 effective batch size 원함. Trick: 여러 batch 동안 zero_grad() 건너뛰고, loss 를 accumulation step 수로 나누고, 끝에 opt.step() + opt.zero_grad(). accum_steps 배 큰 single batch 와 equivalent gradient 생성.

원칙: 5 줄 의식이 training 의 가장 작은 unit. 외워. 그 다음 한 번에 한 변형 (gradient accumulation, mixed precision, gradient clipping) 학습. Basic version 건너뛰지 마.

Code

One train step, with all the optional pieces·python

from torch.nn.utils import clip_grad_norm_

opt.zero_grad()
logits = model(xb)
loss = loss_fn(logits, yb)
loss.backward()
clip_grad_norm_(model.parameters(), max_norm=1.0)   # optional
opt.step()
scheduler.step()                                    # optional

Gradient accumulation for big effective batch·python

ACCUM = 4
opt.zero_grad()
for micro_idx, (xb, yb) in enumerate(loader):
    logits = model(xb)
    loss = loss_fn(logits, yb) / ACCUM        # rescale for averaging
    loss.backward()                            # accumulates into .grad
    if (micro_idx + 1) % ACCUM == 0:
        clip_grad_norm_(model.parameters(), max_norm=1.0)
        opt.step()
        opt.zero_grad()

5 줄 의식

Step 1 의 유명한 두 버그

Gradient accumulation (`zero_grad` 건너뛰는 올바른 방법)

Code

External links

Exercise

Progress

댓글 0

5 줄 의식

Step 1 의 유명한 두 버그

Gradient accumulation (zero_grad 건너뛰는 올바른 방법)

Code

External links

Exercise

Progress

Gradient accumulation (`zero_grad` 건너뛰는 올바른 방법)