Learning Rate Scheduler 와 Warmup

상수 lr 은 거의 optimal 아님

Learning rate decay — 높게 시작, 낮게 끝 — 가 deep learning 의 가장 단순한 무료 win 중 하나. PyTorch 에 여러 scheduler; 셋을 정기 사용:

StepLR — 매 step_size epoch 마다 LR × gamma. 간단, 예측 가능, 역사적 default.
CosineAnnealingLR — initial LR 에서 eta_min 까지 부드러운 cosine curve, T_max step 위. modern Transformer / vision favorite.
OneCycleLR — 한 cycle 안 warmup-then-cooldown. fast training (Leslie Smith 의 1cycle policy) 에 우수.

Warmup — 왜와 어떻게

modern Transformer training 이 거의 항상 몇백 step 의 warmup 으로 시작: 0 에서 target LR 로 선형 증가, 그 다음 decay. 없으면 초기 gradient 가 huge (parameter random, loss 높음, derivative 거침) 하고 optimizer 가 catastrophic step. 있으면 loss landscape 가 well-behaved 인 regime 으로 부드럽게 진입.

warmup + cosine 위해 두 scheduler 합치기 (SequentialLR) 또는 custom LambdaLR. HuggingFace 의 get_cosine_schedule_with_warmup 이 직접 쓰기 skip 하고 싶을 때 canonical 구현.

적절한 granularity 로 step

Per-epoch scheduler (StepLR, epoch 단위 CosineAnnealingLR) — epoch 당 한 번 scheduler.step().
Per-step scheduler (OneCycleLR, step 단위 LambdaLR) — batch 당 한 번 scheduler.step().

Code

StepLR 와 CosineAnnealingLR·python

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR

opt = optim.AdamW([torch.zeros(1, requires_grad=True)], lr=1e-3)

# StepLR: 1e-3 → (epoch 30) → 1e-4 → (epoch 60) → 1e-5
step_sched = StepLR(opt, step_size=30, gamma=0.1)

# CosineAnnealingLR: smooth decay from 1e-3 to 1e-6 over 100 epochs
cos_sched = CosineAnnealingLR(opt, T_max=100, eta_min=1e-6)

# Per-EPOCH schedulers — call once per epoch
for epoch in range(num_epochs):
    train_one_epoch(...)
    cos_sched.step()

OneCycleLR — fast training favorite·python

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import OneCycleLR

steps_per_epoch = len(train_loader)
opt = optim.AdamW(model.parameters(), lr=1e-4)

scheduler = OneCycleLR(
    opt,
    max_lr=1e-3,                       # peak LR
    steps_per_epoch=steps_per_epoch,
    epochs=10,
    pct_start=0.1,                     # warmup is the first 10% of training
    anneal_strategy='cos',
)

# Per-STEP scheduler — call inside the batch loop
for epoch in range(10):
    for x, y in train_loader:
        opt.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()
        opt.step()
        scheduler.step()                # per batch

Warmup + cosine — Transformer recipe·python

import math
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

opt = optim.AdamW(model.parameters(), lr=1e-4)
warmup_steps = 1000
total_steps = 50_000

def lr_lambda(step):
    if step < warmup_steps:
        return step / warmup_steps
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
    return 0.5 * (1.0 + math.cos(math.pi * progress))

scheduler = LambdaLR(opt, lr_lambda)

# Per-step
for epoch in range(num_epochs):
    for batch in train_loader:
        opt.zero_grad()
        loss = criterion(model(batch), target)
        loss.backward()
        opt.step()
        scheduler.step()

Learning Rate Scheduler 와 Warmup

상수 lr 은 거의 optimal 아님

Warmup — 왜와 어떻게

적절한 granularity 로 step

Code

External links

Exercise

Progress

댓글 0