왜 Transformer 가 여전히 dominate (그리고 어떻게 current 유지)

Transformer 가 displace 안 된 다섯 이유

1. Ecosystem maturity

vLLM, TensorRT-LLM, llama.cpp, MLX, SGLang — 전체 서빙 stack 이 attention-based Transformer 에 최적화. Non-Transformer architecture 로 switch 는 inference-optimization 스토리 처음부터 다시 발견 의미. 그 cost 거대.

2. Scaling-law 자신감

Transformer scale 어떻게 알아. Chinchilla, Llama 3 scaling 작업, DeepSeek scaling note — Transformer 파라미터와 compute 가 어떻게 trade off 하는지 십년 empirical 자신감. Alternative architecture 가 100B 파라미터 위에서 이 characterization 부족; 누구도 less-charted scaling curve 에 multi-million-dollar training run 베팅 안 원해.

3. Benchmark 디자인 bias

대부분 evaluation benchmark 가 Transformer 가 잘하는 task — 긴 passage 위 exact recall, precise multi-step reasoning, structured QA — 주위 디자인. Bounded recurrent state 의 alternative architecture 가 benchmark 따라 더-혹은-덜 가능, benchmark 분포 자체가 attention 선호.

4. 산업 risk tolerance

Training run 이 $50M 들면 "promising new architecture" 가 팀이 고르는 거 아냐. Proven architecture 이김. 합리적 — architectural 변화의 first-mover penalty 가 entire training pipeline 다시 함 의미 가능.

5. Reasoning 워크로드가 attention 강점 필요

Extended chain-of-thought reasoning 이 thinking-token sequence 위 exact recall 의존. SSM 과 recurrent variant 가 과거 context 를 bounded state 로 compress, 그런 reasoning 더 어려워. Reasoning paradigm 이 다소 예상 외로 attention 강점을 더 valuable 만듦, 덜 아냐.

Drowning 없이 current 유지하는 법

Primary 출처. Model card, system card, technical report. 가장 reliable.
Survey paper. Annual/quarterly survey 가 noisy paper firehose 종합. 매년 2–3 개 읽을 가치.
Curated 뉴스레터. Sebastian Raschka 의 Ahead of AI, Nathan Lambert 의 Interconnects, Hugging Face 블로그. Selective, high-signal.
커뮤니티. r/LocalLLaMA 가 open-weight tinkering, Papers with Code 가 code-linked paper.

3-질문 filter

어떤 architectural claim 에 대해: (1) Backbone 변화, training 변화, inference 변화 중? (2) Peer-reviewed 또는 technical-report writeup 있어? (3) Open frontier 에서 누군가 reproduce 가능? "Yes" 가 둘 미만이면 회의로 다뤄.

피할 것

Twitter/X hype thread, YouTube clickbait, product 출시 lab 의 마케팅 블로그 포스트. 정보 spread 위해 디자인, inform 안. Default 회의.