Transformer Block

Transformer 는 거의 동일한 block 의 stack

모든 transformer block 이 sublayer 두 개 가짐: multi-head self-attention 과 position-wise feedforward network (각 position 에 independently 적용된 small MLP). 두 sublayer 모두 residual connection 과 layer normalization 으로 wrap.

그게 다야. Block 12 개 쌓으면 small transformer (BERT-base), 96 개면 frontier LLM. Input 에 embedding layer, output 에 LM head 또는 classifier, model 이 token order 알게 position encoding 추가.

팁: Transformer block 을 memory 에서 그릴 수 있으면 (residual + LN, attention, residual + LN, FFN), 모든 modern LLM paper 읽을 수 있어. Architectural variation (RMSNorm, rotary position embedding, GLU activation, mixture of experts) 이 다 basic piece 의 swap-out.

Position encoding

Self-attention 자체가 permutation-invariant. Model 에 order 감각 주려면, token embedding 에 positional 정보 추가. Original transformer 가 sinusoidal encoding 사용, modern LLM 이 RoPE (rotary position embedding) — query/key projection 자체에 position 통합, 어떤 context length 에서도 relative-position reasoning 가능.

볼 attention 의 두 맛

Encoder (BERT, ViT) — bidirectional self-attention, 모든 token 이 모든 다른 token 에 attend. Classification, retrieval, embedding 에 사용.
Decoder (GPT, LLaMA) — triangular mask 의 causal self-attention, 각 token 이 자기와 이전 token 에만 attend. Generation 에 사용.
Encoder-decoder (T5, BART) — 둘 다 + decoder 에서 encoder 로 cross-attention. Translation, summarization, text-to-text task 에 사용.

원칙: Modern transformer 가 대부분 한 block 의 variation. 그 block 깊이 학습 — embedding + positional encoding + N×(LayerNorm + Attention + LayerNorm + FFN with residual) + final norm + head — 그러면 Hugging Face 의 모든 model card 읽을 수 있어.

Code

A pre-norm transformer block·python

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn  = nn.MultiheadAttention(d_model, n_heads, dropout=dropout,
                                           batch_first=True)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn   = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.drop = nn.Dropout(dropout)

    def forward(self, x, attn_mask=None):
        # Pre-norm: norm before each sublayer, residual after
        h = self.norm1(x)
        x = x + self.drop(self.attn(h, h, h, attn_mask=attn_mask, need_weights=False)[0])
        h = self.norm2(x)
        x = x + self.drop(self.ffn(h))
        return x

Transformer 는 거의 동일한 block 의 stack

Position encoding

볼 attention 의 두 맛

Code

External links

Exercise

Progress

댓글 0