Activation: ReLU, GELU, SiLU 와 친구들

Non-linearity 가 network 를 강력하게 만듦

layer 사이 non-linear activation 없으면 linear stack 이 그냥 더 큰 linear 로 합성 — nn.Linear 호출 몇 개든 trainable matrix 하나. Activation 이 network 가 임의 함수 학습하게 해주는 것.

실제 쓰게 될 shortlist

ReLU — max(0, x). CNN 과 pre-Transformer MLP 의 hidden layer default. 싸고 효과적, 근데 너무 많은 neuron 이 영구히 0 치면 'dying ReLU'.
GELU — Gaussian Error Linear Unit. modern Transformer (BERT, GPT, ViT) default. 부드럽고 ReLU 보다 살짝 느리지만, attention 기반 model 에 경험적으로 더 나음.
SiLU (a.k.a. Swish) — x * sigmoid(x). modern ConvNet (EfficientNet) 과 많은 최근 vision Transformer 표준. LLaMA 계열의 FFN gating (SwiGLU) 에도.
Sigmoid — (0, 1) 로 squash. binary classification output 이나 gating mechanism (LSTM gate, gated MLP) 에 사용. 깊은 hidden layer 에선 사용 X — gradient saturate.
Tanh — (-1, 1) 로 squash. 대부분 역사적; LSTM 이 여전히 사용. sigmoid 와 같은 saturation 문제.
Softmax — logit 을 probability distribution 으로. 거의 항상 마지막 dim. classifier 의 final activation 과 attention 의 score normalizer 로 사용.

Module form vs functional form

대부분 activation 이 두 맛: nn.ReLU() (Module) 와 F.relu(x) (function). activation 엔 둘 다 stateless 라 기능적 동등. print(model) 이나 model.modules() 에 보이게 하려면 Module form 을 __init__ 에; forward() 안 일회성엔 functional form.

Code

코드의 shortlist·python

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.randn(4, 8)

# Module form
relu = nn.ReLU()(x)
gelu = nn.GELU()(x)
silu = nn.SiLU()(x)
sigmoid = nn.Sigmoid()(x)
tanh = nn.Tanh()(x)

# Functional form — same result, no module to declare
relu_f = F.relu(x)
gelu_f = F.gelu(x)
silu_f = F.silu(x)

# Softmax — almost always on the last dim
probs = F.softmax(x, dim=-1)
print(probs.sum(-1))   # tensor([1., 1., 1., 1.])

task 별 activation 고르기·python

import torch.nn as nn

# Vision CNN — ReLU is fine, ConvNeXt-style uses GELU
cnn_block = nn.Sequential(nn.Conv2d(3, 32, 3, padding=1), nn.ReLU())

# Transformer block — GELU is the modern default
xfm_ffn = nn.Sequential(nn.Linear(512, 2048), nn.GELU(), nn.Linear(2048, 512))

# Modern ConvNet (EfficientNet flavor) — SiLU
modern_cnn = nn.Sequential(nn.Conv2d(3, 32, 3, padding=1), nn.SiLU())

# Binary classifier output — sigmoid
binary_head = nn.Sequential(nn.Linear(512, 1), nn.Sigmoid())

Softmax — 마지막 dim, log-stable variant·python

import torch
import torch.nn.functional as F

logits = torch.randn(4, 10)            # batch=4, classes=10

# Probabilities — for inspection / sampling
probs = F.softmax(logits, dim=-1)
assert torch.allclose(probs.sum(dim=-1), torch.ones(4))

# log-softmax — numerically stable for log-likelihood losses
log_probs = F.log_softmax(logits, dim=-1)

# F.nll_loss expects log-probabilities; F.cross_entropy expects raw logits
# (it does log_softmax internally — that's why CE doesn't need a softmax layer
#  in your network)

Activation: ReLU, GELU, SiLU 와 친구들

Non-linearity 가 network 를 강력하게 만듦

실제 쓰게 될 shortlist

Module form vs functional form

Code

External links

Exercise

Progress

댓글 0