입력 파이프라인 — 전부 합쳐보기

이제 입력 파이프라인의 모든 부품이 있어. 조립해 보자.

Tokenization. raw text → token ID. 학습 시점에 고정, 모델의 tokenizer 그대로 써야 함.
Embedding lookup. token ID → dense vector, emb.weight[ids]로. 출력 shape (seq_len, d_model).
위치 주입. 다음 중 하나:
- embedding에 sinusoidal PE 더하기(원조 Transformer).
- embedding에 learned positional embedding 더하기(BERT, GPT-2).
- attention 안에서 Q, K에 RoPE 적용; 입력 단계의 embedding은 그대로(Llama, Mistral, Qwen).
- attention score에 ALiBi bias 적용; 입력 단계의 embedding은 그대로(BLOOM, MPT).
Layer-norm/dropout(선택). 일부 모델은 첫 block 전에 입력 embedding을 normalize하고 dropout 적용.
결과가 첫 Transformer block에 들어가.

핵심 개념적 포인트: 위치는 정보이고 아키텍처가 그걸 어디에 끼워 넣을지 선택해. 중립적 선택 없음. RoPE가 현재 우세지만, 나머지 아키텍처는 어떤 스킴 고르든 독립 — 모던 decoder-only Transformer 어디든 RoPE를 ALiBi로 몇백 줄 바꿔서 갈아치울 수 있어.

Code

End-to-end input pipeline (RoPE-style)·python

class TransformerInput(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.emb = nn.Embedding(vocab_size, d_model)
        # No learned positional embedding — RoPE is applied later
    def forward(self, ids):
        return self.emb(ids)             # (B, L, d_model)

# Inside the attention layer:
def attention_with_rope(x, W_q, W_k, W_v, n_heads):
    B, L, d = x.shape
    Q = (x @ W_q).view(B, L, n_heads, d // n_heads)
    K = (x @ W_k).view(B, L, n_heads, d // n_heads)
    V = (x @ W_v).view(B, L, n_heads, d // n_heads)
    Q = rope(Q)                          # rotate by position
    K = rope(K)                          # rotate by position
    # ... standard scaled dot-product attention from here ...

Exercise

Jupyter 노트북 하나에 50줄짜리 함수 작성 — 문자열 + tokenizer + embedding 행렬 받아서 첫 attention block에 들어갈 (seq_len, d_model) tensor 반환. 위치 스킴을 함수 인자로 — sinusoidal / learned / RoPE 스타일 placeholder(입력에선 identity)를 swap 가능하게.

입력 파이프라인 — 전부 합쳐보기

Code

External links

Exercise

Progress

댓글 0