Layer normalization — Pre-LN vs Post-LN

Layer normalization은 layer 내 activation을 평균 0, 분산 1로 정규화하고 학습 가능한 scale/shift 적용해서 학습을 안정화. block 안에서 어디 두느냐가 생각보다 더 중요해.

두 배치

Post-LN(원조 Transformer, 2017): residual connection 다음에 정규화.

x = LayerNorm(x + Sublayer(x))

learning-rate warmup 없이는 학습 잘 안 돼. 깊이가 깊으면 gradient 신호가 폭발/소실 — 세심한 튜닝 없이는. 2017-2019 논문 대부분이 Post-LN, 모던 모델은 피해.

Pre-LN(모던 표준): sublayer 전에 정규화.

x = x + Sublayer(LayerNorm(x))

residual stream은 정규화되지 않은 채 유지 — gradient가 layer norm을 통과하지 않고 skip connection 따라 거슬러 올라갈 수 있어. 학습이 훨씬 안정적. GPT-2, GPT-3, BERT-base, 모든 모던 Llama/Mistral/Claude 변종이 Pre-LN.

RMSNorm

모던 Llama랑 Mistral은 LayerNorm 대신 RMSNorm을 써. 모양 같지만 mean-centering 단계 생략 — (activation − mean) / std 대신 activation의 root mean square로 정규화. 살짝 빠르고, 살짝 파라미터 적고, 경험적으로 똑같이 안정적.

Code

Pre-LN block (modern)·python

class PreLNBlock(nn.Module):
    def __init__(self, d_model, attn, ffn, norm_cls=nn.LayerNorm):
        super().__init__()
        self.norm1 = norm_cls(d_model)
        self.attn = attn
        self.norm2 = norm_cls(d_model)
        self.ffn = ffn
    def forward(self, x):
        x = x + self.attn(self.norm1(x))    # norm BEFORE attention
        x = x + self.ffn(self.norm2(x))      # norm BEFORE FFN
        return x

RMSNorm in 6 lines·python

class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(d_model))
        self.eps = eps
    def forward(self, x):
        # Skip mean centering; just normalize by RMS
        rms = x.pow(2).mean(-1, keepdim=True).rsqrt()
        return self.weight * (x * (rms + self.eps))

Exercise

Pre-LN과 Post-LN block 구현. 작은 Transformer(4 layer)를 copy task로 학습 — learning-rate warmup 있는 거 + 없는 거 × 두 배치 = 4 조합. 어떤 조합이 실제로 안정적으로 학습되는지 보고. (스포일러: Pre-LN은 warmup 없이도 동작, Post-LN은 종종 안 됨.)