Dropout

Dropout 이 하는 일

Training 중 activation 의 fraction p 를 random zero (그리고 expected value 유지하기 위해 나머지를 1/(1-p) 로 scale). Evaluation 중 dropout 은 identity. Net effect 는 model 이 single neuron 에 의존 못 하게 — 정보를 population 에 분산해야 해. 강력한 implicit regularizer.

흔한 rate: transformer 에 p=0.1, CNN 의 fully-connected layer 에 p=0.2-0.5, convolutional layer 에 p=0.0 (정말 원하면 SpatialDropout2d).

팁: Dropout 은 activation 다음 에 적용, 이전 아냐. nn.Sequential(nn.Linear(...), nn.ReLU(), nn.Dropout(0.2), nn.Linear(...)). 반대로 하면 classic 실수 — non-linear 되기 전에 activation zero.

Dropout 이 도움일 때와 해칠 때

Dropout 이 dense MLP 시대와 original transformer 의 큰 승리. BatchNorm 과 modern data augmentation 과 함께 작은 승리 — 일부 vision pipeline 에서 dropout 이 BatchNorm 과 나쁘게 interact 해서 해침. Transformer default, CNN 에 tune, 다른 regularization 이미 있는 tabular model 에 off.

알아둘 변형

SpatialDropout2d 는 CNN 에서 entire feature map drop (인접 channel 이 correlate 할 때 유용). DropPath / Stochastic Depth 는 entire residual block random drop — ConvNeXt 와 ViT 의 매우 deep network 에 사용. Word/Token dropout 은 NLP 에서 entire token drop, 가끔 data-augmentation 대안.

원칙: Dropout 은 시도할 default knob 이지 default 로 켜둘 게 아냐. 본인 task 에 with/without 측정 — 가끔 BatchNorm 과 augmentation 이 이미 gap cover 해서 p=0 이 이김.

Code

Dropout in the right position·python

import torch.nn as nn

mlp = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Dropout(p=0.2),       # after ReLU
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.2),
    nn.Linear(128, 10),
)
# Eval mode disables dropout automatically.
mlp.train(); print(mlp(x))   # noisy outputs (dropout active)
mlp.eval();  print(mlp(x))   # deterministic outputs

Dropout 이 하는 일

Dropout 이 도움일 때와 해칠 때

알아둘 변형

Code

External links

Exercise

Progress

댓글 0