Optimizer

optimizer 가 정하는 것

loss 는 *얼마나 틀렸는지* 알려주고, optimizer 는 *그래서 뭘 할지* 정해. gradient 가 주어지면 각 weight 업데이트의 크기와 방향을 골라. 같은 model 같은 data 인데 깔끔하게 수렴하기도 하고 영원히 헤매기도 하는 이유가 여기 있어 — gradient 는 raw 신호일 뿐, optimizer 가 policy 야.

Optimizer	특징	언제 써
`Adam`	adaptive learning rate	일단 기본값
`AdamW`	weight decay 분리된 Adam	pretrained 모델 fine-tune
`SGD`	단순 + momentum	아직 강함, schedule 과 궁합 좋음
`RMSprop`	parameter 별 adaptive	RNN, non-stationary 목적
`Muon`	Newton 식 업데이트	Keras 3.10 신규, 실험적

고민되면 그냥 Adam. 괜히 default 인 게 아니야 — learning rate 대충 줘도 제일 잘 봐줘. 단, pretrained 모델 fine-tune 하는 순간 AdamW 로 갈아타. decoupled weight decay 가 그 큰 pretrained weight 들이 흘러가지 않게 잡아줘.

learning rate 는 숫자가 아니라 곡선

고정 learning rate 는 타협이야 — 끝을 깔끔하게 내기엔 너무 높고, 시작을 빠르게 하기엔 너무 낮아. schedule 이 이 긴장을 풀어줘. Code 섹션은 warmup 붙은 cosine decay — 처음 몇백 step 은 살살 올려서 초반의 noisy gradient 가 weight 를 날려버리지 않게 하고, 그다음 floor 까지 부드럽게 내려서 마지막 epoch 들이 날카로운 minimum 에 자리잡게 해. schedule 객체를 optimizer 의 learning_rate 에 그대로 넘기면 Keras 가 step 마다 알아서 진행시켜. 학습 중 plateau 치면 ReduceLROnPlateau callback 으로도 자동 감소 가능 — 둘은 보완재야.

Code

warmup 붙은 cosine decay 를 Adam 에 주입·python

# Cosine decay with warmup
lr_schedule = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=1e-3,
    decay_steps=10000,
    alpha=1e-6,       # Minimum learning rate
    warmup_target=1e-3,
    warmup_steps=1000,
)
optimizer = keras.optimizers.Adam(learning_rate=lr_schedule)

optimizer 가 정하는 것

learning rate 는 숫자가 아니라 곡선

Code

External links

Exercise

Progress

댓글 0