Weight Decay 와 Early Stopping

한 문장의 weight decay

Weight decay 는 매 step 에 parameter 를 0 쪽으로 nudge, 어떤 weight 도 dominate 못 하게 하고 부드러운 implicit regularizer 작용. AdamW 에서는 별도 decoupled step (parameter 에서 직접 λw 빼기). SGD 에서는 loss 에 L2 penalty 추가와 mathematically identical.

Default 값: transformer 에 weight_decay=0.01, from-scratch vision CNN 에 weight_decay=1e-4. 10 배 factor 로 tune. Bias 와 LayerNorm parameter 가 아닌 weight matrix 에만 적용.

팁: Transformer recipe 에서 weight_decay=0.01 복사해서 small MLP 에 적용하면 training 은 fine 해 보이지만 generalize 더 나쁨. 본인 task 에 tune — 다른 도메인 default cargo-cult 하지 마.

Early stopping 올바르게

Epoch 마다 validation loss 추적. Val loss 가 새 low 찍을 때마다 model 저장. patience epoch 동안 val loss 개선 없으면 training stop. 저장된 best checkpoint 가 ship 할 거.

흔한 값: 짧은 training 에 patience=10, 긴 training 에 patience=50. Noisy 0.0001 개선이 patience counter reset 안 하게 min_delta 추가.

Interaction

Weight decay 와 early stopping 은 실전에서 거의 equivalent regularizer — 둘 다 model 이 noise 외우는 거 막아. 둘 다 써, 근데 linearly compose 안 한다고 기대하지 마. 강한 augmentation 있으면 둘 다 덜 가능.

원칙: 항상 best-validation checkpoint 저장. 사용 안 할 거 같아도 항상 early stopping configured. 비용은 두 줄, 잊은 비용은 본인 training run 이 이미 만든 것보다 나쁜 model ship.

Code

Early stopping with patience·python

class EarlyStopper:
    def __init__(self, patience=10, min_delta=1e-4):
        self.patience = patience
        self.min_delta = min_delta
        self.best = float("inf")
        self.bad = 0
    def step(self, val_loss):
        if val_loss < self.best - self.min_delta:
            self.best = val_loss
            self.bad = 0
            return False         # don't stop
        else:
            self.bad += 1
            return self.bad >= self.patience

stopper = EarlyStopper(patience=10)
for epoch in range(1000):
    train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)
    if val_loss < stopper.best:
        torch.save(model.state_dict(), "best.pt")
    if stopper.step(val_loss):
        print(f"Stopping at epoch {epoch}, best val_loss={stopper.best:.4f}")
        break

한 문장의 weight decay

Early stopping 올바르게

Interaction

Code

External links

Exercise

Progress

댓글 0