풀 block — 전부 합치기

~12 min · block, synthesis

Level 0Token

0 XP0/94 lessons0/10 achievements

0/120 XP to next level120 XP to go0% complete

이제 모든 부품이 있어. 완성된 모던 decoder-only Transformer block은 이렇게 생겼어:

x ← x + MultiHeadAttention(RMSNorm(x))
x ← x + SwiGLU_FFN(RMSNorm(x))

이게 다야. sublayer 둘, 각각 normalization 다음에 residual 덧셈. N개 복사해서 쌓으면 풀 Transformer 몸체. 마지막 block의 출력이 최종 RMSNorm + LM head(vocab 크기로 projection, 종종 input embedding과 weight-tied) 통과.

모델 간 변하는 것

block은 주요 모델 계열을 가로질러 놀랍도록 안정적. 변하는 건:

Attention 종류: dense MHA / GQA / MQA / sliding-window / sparse.
위치 스킴: RoPE / ALiBi / iRoPE / Sandwich / yarn-scaled.
활성: GELU / SwiGLU / GeGLU.
FFN: 표준 / MoE — FFN을 router + 많은 expert로 대체.
Normalization: LayerNorm / RMSNorm / DeepNorm(드문).

이 diagram을 읽을 수 있으면 사실상 모든 모던 모델 카드를 읽을 수 있어. 안 변하는 95%의 아키텍처를 이제 체화한 거야.

Code

A Llama-style block, end to end·python

class LlamaBlock(nn.Module):
    def __init__(self, d_model, n_q_heads, n_kv_heads, d_ff):
        super().__init__()
        self.norm1 = RMSNorm(d_model)
        self.attn = GroupedQueryAttention(d_model, n_q_heads, n_kv_heads)
        self.norm2 = RMSNorm(d_model)
        self.ffn = SwiGLU(d_model, d_ff)
    def forward(self, x, rope_cache=None, mask=None):
        x = x + self.attn(self.norm1(x), rope_cache=rope_cache, mask=mask)
        x = x + self.ffn(self.norm2(x))
        return x

# Stack:
class LlamaModel(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers, n_q_heads, n_kv_heads, d_ff):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.blocks = nn.ModuleList([
            LlamaBlock(d_model, n_q_heads, n_kv_heads, d_ff)
            for _ in range(n_layers)
        ])
        self.final_norm = RMSNorm(d_model)
        # Weight tying: LM head reuses tok_emb.weight
    def forward(self, ids):
        x = self.tok_emb(ids)
        for block in self.blocks:
            x = block(x)
        x = self.final_norm(x)
        logits = x @ self.tok_emb.weight.T
        return logits

External links

Exercise

karpathy/nanoGPT/model.py를 끝까지 읽어. 이 퀘스트에서 본 모든 컴포넌트 식별(tokenizer → embedding → causal mask → multi-head attention → residual → norm → MLP → LM head). 각 섹션마다 다룬 lesson과 연결하는 주석 달아. 주석 달린 사본 제출.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousResidual connection — layer 80개 쌓을 수 있는 이유 Next →깊이 vs 너비 — 파라미터를 어디 쓸까

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.