활성 함수 — ReLU → GELU → SwiGLU

FFN 안 비선형 활성 함수는 세 시대를 거쳐 진화.

ReLU (원조 Transformer)

max(0, x). 단순, 빠름. "dead neuron" 문제 — 음수 구역으로 밀린 유닛이 절대 회복 못 해, gradient가 정확히 0이라. 2017 Transformer랑 많은 초기 모델 사용, 모던 LLM은 거의 안 써.

GELU (GPT-2/3, BERT)

x · Φ(x), Φ는 표준 정규 CDF. ReLU를 부드럽게 근사하지만 gradient가 어디서나 0이 아니야. dead neuron 없음. 경험적 성능 더 좋음. GPT-2, GPT-3, BERT, RoBERTa, T5가 사용.

SwiGLU (모던 표준)

Swish(부드러운 ReLU 변종)와 곱셈 게이트 결합. SwiGLU는 가중치 행렬 둘이 아니라 셋 — 세 번째가 활성을 element-wise 조절하는 "gate":

SwiGLU(x) = (Swish(x · W₁) ⊙ (x · W₃)) · W₂

gate가 모델한테 어떤 feature를 통과시키고 어떤 걸 억제할지 세밀한 제어를 줘. 파라미터 수를 4× FFN과 비슷하게 유지하려고 SwiGLU 쓸 땐 d_ff를 ~8/3 × d_model로 줄여. Llama, Mistral, Mixtral, Qwen, Gemma — 모든 모던 open-weight LLM이 SwiGLU.

Code

SwiGLU implementation·python

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        # Three weight matrices instead of two
        self.W1 = nn.Linear(d_model, d_ff, bias=False)
        self.W2 = nn.Linear(d_ff, d_model, bias=False)
        self.W3 = nn.Linear(d_model, d_ff, bias=False)
    def forward(self, x):
        return self.W2(F.silu(self.W1(x)) * self.W3(x))
        # F.silu is the same as Swish (x * sigmoid(x))

Activation comparison plot·python

import torch, matplotlib.pyplot as plt
x = torch.linspace(-3, 3, 200)
plt.plot(x, F.relu(x).numpy(),  label='ReLU')
plt.plot(x, F.gelu(x).numpy(),  label='GELU')
plt.plot(x, F.silu(x).numpy(),  label='Swish (SiLU)')
plt.legend(); plt.grid(); plt.savefig('activations.png')

Exercise

동일한 작은 Transformer 셋을 활성 함수만 바꿔 학습 — ReLU, GELU, SwiGLU. 같은 학습 토큰 수로 같은 축에 loss 곡선 플롯. 그 다음 파라미터 수 비교(SwiGLU는 FFN에 50% 더 많음). d_ff 조정해서 셋이 ~동일한 파라미터 수가 되게 하고 재실행.