Learning Rate Schedule

Learning rate 가 가장 중요한 hyperparameter

Learning rate 가 너무 작으면 training crawl. 너무 크면 training diverge. 올바른 값은 optimizer, model, batch size, dataset, training stage 에 따라 달라. Universal rule 은 없지만, 작동하는 패턴들은 있어.

대부분의 modern training 에 standard recipe 는 warmup + cosine decay: 첫 수백~수천 step 동안 0 에서 lr_max 까지 linearly ramp, 그 다음 나머지 training 동안 ~0 으로 부드럽게 decay. 이 조합이 unstable early training 과 diminishing-returns late training 둘 다 깔끔히 다룸.

팁: Warmup 없이 transformer train 하면, 첫 백 step 에서 loss 가 explode 하거나 stall 하는 거 봐. Warmup 은 어떤 size 이상의 attention-based model 에 non-optional.

Toolbox 의 흔한 schedule

Constant — 짧은 training 과 prototyping 에 OK, final 선택은 거의 안 됨.
Step decay — N epoch 마다 lr 에 0.1 곱. ResNet-style training 의 classic, 요즘 덜 흔함.
Cosine annealing — cosine 의 후반부 모양의 부드러운 decay. 거의 어디서나 작동.
Cosine with warm restart — 주기적 refresh 원하는 매우 긴 training run 에 유용.
OneCycle — Smith 의 recipe: ramp up, ramp down, peak 가운데. 짧은 fine-tuning run 에 강함.

좋은 `lr_max` 찾는 법

'LR finder' trick (Smith, 2017): 1e-7 에서 10 까지 exponentially 증가하는 lr 로 수백 step train. Loss vs lr 그려. Loss 가 가장 빠르게 descending 하는 lr 값 골라, 2-4 로 나눠. 보통 blind hyperparameter search 보다 이김.

원칙: Training duration 기반으로 schedule 먼저 골라, LR finder 로 lr_max 골라, 그 다음 둘 다 그대로 두고 다른 모든 거 tune. 중간에 schedule 조정은 confusion recipe.

Code

Warmup + cosine schedule·python

import torch, math
from torch.optim.lr_scheduler import LambdaLR

def warmup_cosine(optimizer, warmup_steps, total_steps, min_ratio=0.1):
    def fn(step):
        if step < warmup_steps:
            return step / max(1, warmup_steps)
        progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
        cosine = 0.5 * (1 + math.cos(math.pi * progress))
        return min_ratio + (1 - min_ratio) * cosine
    return LambdaLR(optimizer, fn)

opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
sch = warmup_cosine(opt, warmup_steps=500, total_steps=20_000)

for step in range(20_000):
    opt.zero_grad()
    loss = compute_loss(model, batch())
    loss.backward(); opt.step(); sch.step()

Learning rate 가 가장 중요한 hyperparameter

Toolbox 의 흔한 schedule

좋은 `lr_max` 찾는 법

Code

External links

Exercise

Progress

댓글 0

Learning rate 가 가장 중요한 hyperparameter

Toolbox 의 흔한 schedule

좋은 lr_max 찾는 법

Code

External links

Exercise

Progress

좋은 `lr_max` 찾는 법