Learning rate — warmup + cosine decay

learning rate 스케줄 — 학습 과정에서 LR이 어떻게 변하는지 — 가 안정적이고 효율적인 학습에 결정적. 모던 LLM은 거의 보편적으로 warmup + cosine decay 사용.

왜 warmup

학습 시작에 가중치는 랜덤, gradient는 noisy. 높은 LR은 그 noise 증폭하고 최적화 궤적을 영구 손상 가능. 첫 ~1-5% 학습 step 동안 0에서 peak LR로 선형 ramp 하면 optimizer가 먼저 안정 영역 찾고 그 다음 빠르게 가.

왜 cosine decay

warmup 끝나면 가능한 한 peak LR 쓰고 싶지만, 모델이 local optimum에 가까워지면 천천히 줄여야 해. cosine decay — lr(t) = min_lr + 0.5 × (max_lr - min_lr) × (1 + cos(π × progress)) — 가 부드럽고 파라미터 없이 그 일을 해. 경험적으로 더 정교한 스케줄(예: 선형-후-flat)만큼 통해.

원조 2017 Transformer는 다른 스케줄 사용(Noam scheduler — lr ∝ d_model^(-0.5) × min(step^(-0.5), step × warmup^(-1.5))). 모던 open 레시피(Llama, Mistral, nanoGPT)는 warmup + cosine으로 옮겨왔어.

Code

Warmup + cosine decay schedule·python

import math

def lr_schedule(step, total_steps, warmup_steps=2000,
                max_lr=3e-4, min_lr=3e-5):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    progress = min(1.0, max(0.0, progress))
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

# Plug into PyTorch:
# scheduler = torch.optim.lr_scheduler.LambdaLR(
#     optimizer,
#     lambda step: lr_schedule(step, total_steps) / max_lr
# )

Learning rate — warmup + cosine decay

왜 warmup

왜 cosine decay

Code

External links

Exercise

Progress

댓글 0