Linear, Bias, MLP

nn.Linear — 가장 중요한 layer

nn.Linear(in_features, out_features) 가 affine 변환 y = x W^T + b 적용. weight matrix 는 (out_features, in_features), bias 는 (out_features,). 이 single layer 가 대부분 model parameter 의 90% 가 사는 곳 (Transformer FFN block, classifier head, embedding projection — 다 Linear).

초기화 default

default 로 nn.Linear 가 weight 를 Kaiming uniform (ReLU 계열에 좋음) 으로, bias 를 input fan-in 으로 scale 된 uniform 으로 init. 보통 override 안 하지만 — 존재 안다는 게 'model 이 train 안 됨' 디버깅을 tractable 하게.

MLP idiom

multi-layer perceptron 은 Linear → activation → Linear → activation → ... → Linear. modern variant 는 regularization 위해 Dropout, 안정성 위해 LayerNorm, 매우 깊은 network 위해 skip connection 추가. pattern 은 universal — 모든 Transformer FFN block 이 중간에 non-linearity (보통 GELU) 둔 2-layer MLP.

Code

nn.Linear 기본·python

import torch
import torch.nn as nn

# 20 inputs → 10 outputs
linear = nn.Linear(20, 10)
print(linear.weight.shape)   # torch.Size([10, 20])  — out_features × in_features
print(linear.bias.shape)     # torch.Size([10])

x = torch.randn(32, 20)      # batch=32, features=20
y = linear(x)
print(y.shape)               # torch.Size([32, 10])

# Without bias (rare but useful in some Transformer variants)
linear_nb = nn.Linear(20, 10, bias=False)
print(linear_nb.bias)        # None

MLP 짓기 — canonical pattern·python

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim, p_drop=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.GELU(),                # transformer-flavor; ReLU works too
            nn.Dropout(p_drop),
            nn.Linear(hidden, hidden),
            nn.GELU(),
            nn.Dropout(p_drop),
            nn.Linear(hidden, out_dim),
        )

    def forward(self, x):
        return self.net(x)

model = MLP(784, 256, 10)
print(f"Params: {sum(p.numel() for p in model.parameters()):,}")
# Params: 269,322

Custom init — default 가 원하는 거 아닐 때·python

import torch
import torch.nn as nn
import math

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden)
        self.fc2 = nn.Linear(hidden, out_dim)
        self.act = nn.GELU()
        self._init_weights()

    def _init_weights(self):
        # Custom Xavier init — useful when you want a specific behavior
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_normal_(m.weight)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)

    def forward(self, x):
        return self.fc2(self.act(self.fc1(x)))

nn.Linear — 가장 중요한 layer

초기화 default

MLP idiom

Code

External links

Exercise

Progress

댓글 0