Residual Connection

지난 10 년의 가장 중요한 architecture trick

Residual connection 은 y = x + f(x): block 의 output 이 input 더하기 block transformation. 'Skip' 이 gradient 를 block 의 local Jacobian 곱 없이 depth 통해 흐르게 해. 100-layer network (ResNet) 와 175-billion-parameter transformer (GPT-3 이상) 가 실제 trainable 한 이유.

Residual 전에는 layer 20-30 넘게 추가하면 training 이 나빠졌어. Residual 후에는 depth 가 끝까지 보답. Regularization trick 아니라 — depth-enabler 야.

팁: 코드에 x + sublayer(x) 보이면 residual connection. Shape 가 일치하는지 spot-check (skip + transform 이 같은 shape 여야). 안 맞으면 skip 에 projection (nn.Linear 또는 stride 있는 nn.Conv2d) 필요.

Residual 이 어디 나타나는가

Modern 한 곳 어디나. ResNet (CNN), Transformer (모든 block 에 residual 두 개 — attention 과 FFN), DenseNet (add 대신 concat), U-Net (encoder-to-decoder skip), Mamba/SSM block. Pattern 이 architecture family 초월.

간단한 math

y = x + f(x) 면, dy/dx = 1 + f'(x). f'(x) 가 작아도 (vanishing gradient), +1 이 gradient 살아있게 유지. Chain rule 의 곱이 sum-of-product 가 돼서 numerically 훨씬 forgiving.

원칙: Residual connection 이 deep learning 이 가진 'free' architectural improvement 에 가장 가까운 것. 본인이 design 하거나 빌려오는 모든 architecture 가 처음부터 residual 있어야 해, 나중에 bolt on 아냐.

Code

ResNet basic block·python

import torch.nn as nn

class BasicBlock(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, stride=1, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_ch)
        self.relu  = nn.ReLU(inplace=True)
        # Projection if shapes don't match
        if stride != 1 or in_ch != out_ch:
            self.skip = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_ch),
            )
        else:
            self.skip = nn.Identity()

    def forward(self, x):
        identity = self.skip(x)
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return self.relu(out + identity)

지난 10 년의 가장 중요한 architecture trick

Residual 이 어디 나타나는가

간단한 math

Code

External links

Exercise

Progress

댓글 0