Residual connection — layer 80개 쌓을 수 있는 이유

residual connection은 sublayer의 입력을 출력에 더해 — output = x + Sublayer(x). 단순함 뒤에 모던 Transformer가 학습 붕괴 없이 layer 80+개로 스케일하게 하는 세 성질이 숨어 있어.

Gradient 고속도로. backprop이 residual 덧셈을 그대로 통과. Sublayer(x)가 잘못 초기화돼도 gradient가 skip path 따라 이전 layer까지 도달 가능. 2015년 이전 deep network를 최대 수십 layer로 제한하던 vanishing-gradient 문제를 해결.
Identity 초기화. 초기에 Sublayer 가중치가 작으면 x + Sublayer(x) ≈ x. 각 block이 대략 identity로 시작 — block 더 추가해도 동작하는 모델이 안 깨져. 학습이 block들을 한 번에 하나씩 유용한 변환으로 살살 옮겨.
Ensemble 해석. N개의 residual block이 있는 네트워크를 "skip vs 사용" 결정의 다른 조합 2^N개 경로의 ensemble로 볼 수 있어. Veit et al. (2016)이 모델의 유효 깊이 대부분이 nominal 깊이보다 짧다는 걸 보여줬어 — 아키텍처가 layer dropout에 강건.

residual connection 없으면 ~6 layer보다 깊은 Transformer는 학습하기 힘들어. 있으면 32(Llama 3 8B), 80(Llama 3.3 70B), 심지어 126(Llama 4 Maverick 변종)개 block을 쌓아도 깔끔하게 최적화돼.

Code

Block with residuals·python

class TransformerBlock(nn.Module):
    def __init__(self, d_model, attn, ffn, norm_cls=RMSNorm):
        super().__init__()
        self.norm1 = norm_cls(d_model)
        self.attn = attn
        self.norm2 = norm_cls(d_model)
        self.ffn = ffn
    def forward(self, x):
        # Each sublayer is added to the residual stream
        x = x + self.attn(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

Residual connection — layer 80개 쌓을 수 있는 이유

Code

External links

Exercise

Progress

댓글 0