Activation Function

이유 있는 activation menu

ReLU — max(0, x). 2012 년 이후 CNN 과 대부분의 MLP hidden layer default. 싸고, sparse, gradient 가 1 또는 0 — backprop 쉬워. 실패 모드: dead neuron (bias 가 collapse 해서 영원히 0 에 갇힌 unit).

GELU — Gaussian Error Linear Unit, ReLU 보다 0 근처 부드러움. Transformer (BERT, GPT, ViT) default. 약간 비싸지만 scale 에서 더 잘 train. x * Φ(x).

SiLU / Swish — x * sigmoid(x). EfficientNet, LLaMA, modern image-generation model 에 사용. GELU 처럼 부드럽고 약간 더 싸.

Sigmoid — (0,1) 로 squash. Binary classification output activation 으로 OK (BCE loss 와 pair). Deep network hidden activation 으로는 나쁨 — gradient saturate, training 멈춰.

Tanh — (-1, 1) 로 squash, zero-centered. RNN (LSTM, GRU) 안에 쓰임. Deep stack 에는 sigmoid 와 같은 saturation 문제.

Softmax — logit vector 를 probability distribution 으로 변환. 거의 항상 cross-entropy loss 직전의 마지막 step. PyTorch CrossEntropyLoss 가 이미 numerically stable softmax 포함 — 쌓지 마.

팁: Hidden 은 ReLU (transformer 면 GELU) default. Sigmoid 는 binary output 만. Softmax 는 final classifier 만. 이 3 줄 rule 이 architecture 의 90% 를 cover.

왜 activation 이 trainability 를 결정하는가

Activation 이 gradient flow 를 통제. ReLU 는 active 한 곳마다 gradient 1 (그 외 0) 을 전달, deep stack trainable 유지. Sigmoid 와 tanh 는 tail 에서 saturate, gradient 죽임. 잘못된 activation 선택이 textbook architecture 가 진짜 dataset 에서 아무것도 학습 못 하는 가장 흔한 이유.

원칙: Network 가 train 거부하면, data loader 다음으로 확인할 첫 번째가 activation 선택. Tanh 를 ReLU 로 바꾸고 뭐 일어나는지 봐.

Code

A tour of activations·python

import torch, torch.nn.functional as F
import matplotlib.pyplot as plt

x = torch.linspace(-5, 5, 200)
acts = {
    "ReLU":    F.relu(x),
    "GELU":    F.gelu(x),
    "SiLU":    F.silu(x),
    "Sigmoid": torch.sigmoid(x),
    "Tanh":    torch.tanh(x),
}
for name, y in acts.items():
    plt.plot(x, y, label=name)
plt.legend(); plt.grid(True); plt.show()

Softmax goes at the very end·python

import torch, torch.nn as nn
logits = torch.randn(4, 10)
probs  = torch.softmax(logits, dim=-1)
print(probs.sum(dim=-1))  # tensor([1., 1., 1., 1.])

# DO NOT do softmax + CrossEntropyLoss. The loss already includes it.
loss = nn.CrossEntropyLoss()(logits, torch.tensor([3, 0, 9, 5]))
print(loss.item())

이유 있는 activation menu

왜 activation 이 trainability 를 결정하는가

Code

External links

Exercise

Progress

댓글 0