세 개의 lens — Dense, MoE, Reasoning · Beyond the Transformer Quest

Taxonomy 가 아니라 primary lens 셋

이 퀘스트의 나머지는 현대 LLM 풍경을 보는 세 개의 primary lens 를 써. 상호배타적인 카테고리가 *아니야* — DeepSeek-R1 은 MoE *이면서* reasoning-oriented 야. 독립적인 차원이야, RGB 컬러 채널처럼.

Lens 1 — Dense

모든 토큰이 모든 파라미터 활성화. 잘 이해된 baseline. Llama, Gemma, Qwen dense, Mistral, Phi-4. Fine-tune 쉽고, 서빙 쉽고, 디버깅 쉽고, 지구상 모든 inference engine 이 잘 지원해. 100–400B 정도에서 토큰당 FLOP cost 가 험해지면서 천장 침.

Lens 2 — Mixture of Experts

레이어마다 작은 expert FFN 여러 개, 토큰당 몇 개만 발화. Mixtral, DeepSeek-V2/V3/R1, Llama 4, Qwen3 MoE, Gemma 4 MoE, Mistral Large 3. 메모리와 서빙 복잡도를 비용으로 quality 천장이 더 높아. Frontier scale 가 affordable 해지는 길.

Lens 3 — Reasoning-oriented

답하기 전에 생각하느라 inference compute 를 더 써. o-series, Claude extended thinking, Gemini thinking modes, DeepSeek-R1, Qwen3 thinking mode. 거의 항상 dense 나 MoE backbone 위에 얹혀 — reasoning 은 post-training 과 inference 축에 살지, wiring 에 살지 않아.

어떻게 결합돼

Combination	Example	Behavior
Dense + standard	Llama 3.3 70B	예측 가능, 빠름, 저렴, 대화 baseline
Dense + reasoning	Qwen3 32B (thinking on)	같은 checkpoint, 더 높은 latency, 어려운 task 에 훨씬 좋음
MoE + standard	DeepSeek-V3	Frontier-scale 지식을 ~30B inference cost 로
MoE + reasoning	DeepSeek-R1	현재 frontier paradigm. Capacity + careful thinking.

Code

네 조합이 다 같은 코드 경로에 살아·python

def call_model(model, prompt, *, thinking=False):
    # Backbone (dense vs MoE) decides cost shape and engine config.
    # The thinking flag toggles inference strategy without touching weights.
    return model.generate(prompt, extended_thinking=thinking)

세 개의 lens — Dense, MoE, Reasoning

Taxonomy 가 아니라 primary lens 셋

Lens 1 — Dense

Lens 2 — Mixture of Experts

Lens 3 — Reasoning-oriented

어떻게 결합돼

Code

External links

Exercise

Progress

댓글 0