Encoder, Decoder, Encoder-Decoder — 모양 셋, 역할 셋

2017년 트랜스포머는 encoder-decoder였어. 2024년 즈음엔 세 가지 모양으로 깔끔하게 갈라졌고, 프론티어 작업은 거의 다 그 중 하나만 써.

모양	attention 패턴	pre-train objective	예시	잘하는 영역
Encoder-only	양방향 (모든 토큰이 모든 토큰 봄)	masked LM (가린 토큰 맞추기)	BERT, RoBERTa, DeBERTa	분류, NER, retrieval embedding
Decoder-only	causal (각 토큰이 이전 토큰만 봄)	next-token prediction	GPT-4/5, LLaMA 4, Claude, Mistral	생성, chat, 코드, 에이전트
Encoder-decoder	양방향 encoder + causal decoder + cross-attention	span corruption / denoising	T5, BART, Whisper, NLLB	번역, 요약, ASR

왜 decoder-only가 LLM 경주를 이겼나

이유 셋. 첫째, 생성이 빌트인 — next-token 예측으로 학습된 모델은 샘플링만 해도 생성기가 돼. 별도 decoder 사전학습이 필요 없어. 둘째, 분류나 retrieval도 prompt 구조만 잘 잡으면 생성 task로 풀어. 셋째, 아키텍처 자체가 단순 — encoder/decoder 분리 없고, cross-attention sub-layer 없고, 튜닝할 하이퍼파라미터가 적어.

encoder-only가 사라진 건 아니야. RAG 시스템에서 쓰는 BGE, E5 같은 dense retrieval embedding이 다 encoder-only야. 다만 "사람이 말 거는 모델"은 decoder-only가 만국 공통 답이 됐을 뿐.

Code

Causal mask — what makes decoder-only 'decoder-only'·python

import torch
n = 6
mask = torch.triu(torch.ones(n, n), diagonal=1).bool()
# True where we should NOT attend (future positions)
# Position 0 sees nothing future; position n-1 sees everything past + itself.

scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)
scores = scores.masked_fill(mask, float('-inf'))
weights = torch.softmax(scores, dim=-1)
out = weights @ V

Encoder, Decoder, Encoder-Decoder — 모양 셋, 역할 셋

왜 decoder-only가 LLM 경주를 이겼나

Code

External links

Exercise

Progress

댓글 0