Healthy Skepticism Framework

속지 않고 새 architecture 논문 읽는 법

Post-Transformer literature 가 항상 reproduce 안 되는 인상적인 숫자로 가득. credible 주장과 paper-flavored hype 구분하는 framework:

Scope: 이 architecture 가 niche 에서만 또는 across the board Transformer 이김? "7B scale 의 long-context summarization 에서 Transformer 이김" 이 진짜이고 useful 한 주장. qualifier 없는 "Transformer 이김" 은 보통 숨겨진 niche 가져.

Baseline: Transformer 비교가 최적화된 Transformer (FlashAttention-3, GQA, modern 학습 recipe) 인가, 아무도 실제 배포 안 할 naive 구현인가? 많은 "alternative 가 Transformer 이김" 결과가 Transformer baseline 에 진짜 production 팀이 사용할 최적화 포함되면 사라져.

Maturity: Hardware/software stack 이 production 위해 mature 한가, 아니면 specific CUDA 버전, custom kernel, 3 개월 integration 작업 필요한 코드의 paper 읽는 건가? Architecture 와 deployment readiness 가 별개 axis.

Reproducibility: 학습 recipe 가 발표되고 reproducible 한가? Third party 가 헤드라인 결과 replicate 했나? 발표되었지만 not-reproduced 결과가 흔하고 보이는 것보다 가치 적어.

버려야 할 두 myth

Myth: "Mamba 가 어디서나 attention 교체." Reality: pure Mamba 가 ICL/recall 에 실패 — 가장 성공한 Mamba 배포 (Jamba, Nemotron-H, Granite 4.0) 다 attention layer 유지. Hybrid 가 실용 경로.

Myth: "Efficient attention 과 attention-free model 이 같은 거." Reality: fundamentally 달라. Efficient attention (FlashAttention, sparse attention) 이 attention 메커니즘 보존하면서 구현 최적화. Attention-free 모델 (SSM, RWKV) 이 attention 완전 교체. Quality-efficiency trade-off 가 완전 달라. 함께 묶지 마.

가장 중요한 single 습관

이론적 complexity (O(n) vs O(n²)) 가 constant factor 와 hardware 최적화 때문에 항상 wall-clock speedup 으로 번역 안 돼. 항상 paper 숫자 아닌 실제 워크로드로 벤치마크. hand-tuned synthetic 에서 1M context 에서 5x speedup 이 P95 input 이 4K 이고 serving hardware 에 kernel 없으면 안 도와줘.

Exercise

찾을 수 있는 가장 최근 post-Transformer architecture 논문 골라 (2026 의 어떤 달이든). 네-질문 framework 에 명시적으로 통과: scope, baseline (그리고 최적화 됐는지), maturity, third-party replication 적어. 그 네 답 기반으로 논문이 research-interesting 만인지, production 위해 actionable 인지 결정. 이 운동 한 번 하면 career 나머지 사용할 framework 받아.

속지 않고 새 architecture 논문 읽는 법

버려야 할 두 myth

가장 중요한 single 습관

External links

Exercise

Progress

댓글 0