Decoder-only — GPT에서 Llama까지, 보편적 모양

decoder-only 패러다임 — causal self-attention + next-token prediction — 가 모던 언어 AI의 보편 모양. 2026년에 이름 댈 수 있는 모든 대화 AI가 이 아키텍처 사용.

계보 하이라이트

GPT-1 (2018). 117M params, 12 layer. unsupervised 사전학습 + supervised fine-tune이 NLP task를 가로질러 통한다는 걸 보여줬어.
GPT-2 (2019). 1.5B params, 48 layer. 일관된 다단락 생성 보여줌; OpenAI가 오용 우려로 초기에 가중치 보류.
GPT-3 (2020). 175B params, 96 layer, d_model=12,288. 스케일이 in-context learning 잠금 해제 — fine-tune 없이 prompt 예시로 새 task 해결.
Llama 1 (2023). Meta의 첫 open release. 7B-65B 변종. 독점 프론티어와 경쟁하는 첫 open-weight 모델.
Llama 3 (2024). 8B와 70B. 128K vocab, GQA, RoPE, SwiGLU — 모던 open-weight 아키텍처를 정의.
Llama 4 (2025). Scout(109B/17B active), Maverick(400B/17B active). Mixture-of-Experts, iRoPE, multimodal. 10M 컨텍스트 추진.

GPT-1에서 Llama 4까지 아키텍처 drift는 작아. unit cell — Pre-LN + multi-head attention + residual + Pre-LN + FFN + residual — 은 안 변함. 스케일된 건 차원, 데이터, post-training 스택.

Code

Decoder-only is just a stack of one block type·python

class DecoderOnly(nn.Module):
    def __init__(self, vocab, d_model, n_layers, n_q_heads, n_kv_heads, d_ff):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab, d_model)
        self.blocks = nn.ModuleList([
            LlamaBlock(d_model, n_q_heads, n_kv_heads, d_ff)
            for _ in range(n_layers)
        ])
        self.final_norm = RMSNorm(d_model)
    def forward(self, ids):
        x = self.tok_emb(ids)
        for blk in self.blocks:
            x = blk(x)
        x = self.final_norm(x)
        return x @ self.tok_emb.weight.T   # weight-tied LM head

# That's the entire architecture. Llama 3, Mistral, Qwen, Phi —
# they all instantiate this with different (vocab, d_model, n_layers, ...).

Decoder-only — GPT에서 Llama까지, 보편적 모양

계보 하이라이트

Code

External links

Exercise

Progress

댓글 0