MoE 패밀리 — Mixtral, DeepSeek, Llama 4, Qwen3

Mixtral — 문 연 모델

Mixtral 8×7B (2023년 12월, Mistral AI) 가 broader 커뮤니티가 실제로 돌릴 수 있는 첫 capable open-weight MoE. 46.7B / 12.9B, 8 experts top-2, shared experts 없음. 대부분 benchmark 에서 Llama 2 70B 와 동급 또는 능가, 토큰당 compute cost 약 3배 낮음. Mixtral 8×22B 가 141B / 39B 로 따라옴, 훨씬 큰 dense 모델과 경쟁. 2024 후 Mistral 이 라인 Mistral Small 4, Mistral Large 3 로 계속.

DeepSeek — fine-grained-expert 디자인

DeepSeek 가 fine-grained expert segmentation 개척 — 큰 expert 적게 대신 작은 expert 많게. Mixtral 이 8 experts (28 가능한 2-expert 조합) 인 데 비해 DeepSeek-V2 는 160 experts top-6 (10¹⁵ 가능한 조합). Routing expressiveness 극적으로 증가, expert 가 훨씬 좁은 패턴에 specialize 가능.

Model	Total	Active	Experts	Top-K	Innovation
DeepSeek-V2	236B	21B	160 + 2 shared	top-6	Fine-grained, MLA attention
DeepSeek-V3	671B	37B	256 + 1 shared	top-8	Aux-loss-free balancing, FP8, sigmoid routing
DeepSeek-R1	671B	37B	256 + 1 shared	top-8	V3 와 같은 arch + GRPO reasoning RL

Llama 4 — Meta 의 MoE pivot

Llama 4 (2025년 4월) 가 Meta 의 첫 MoE 패밀리, top-2 대신 top-1 — 모든 토큰이 정확히 한 expert 선택.

Model	Total	Active	Experts	Top-K	Context
Llama 4 Scout	109B	17B	16	top-1	10M
Llama 4 Maverick	400B	17B	128	top-1	1M
Llama 4 Behemoth	~2T	288B	—	—	preview (teacher)

Qwen3 MoE — 많은 expert, shared 없음

Qwen3 (2025) 두 MoE variant ship: Qwen3 30B-A3B (128 experts top-8, no shared) 와 Qwen3 235B-A22B (128 experts top-8, no shared). 둘 다 단일 checkpoint 에서 dual thinking/non-thinking mode 지원. Qwen3 235B-A22B 가 2025–2026 self-hosting 에 가장 capable 한 open-weight MoE 중 하나.

한눈에 보는 trend

패밀리와 연도 가로질러: expert 수가 많아지고 작아지고, top-K 가 1–2 에서 6–8 로 오르고, sigmoid routing 이 softmax 대체, shared experts 들락날락. MoE 디자인 space 가 여전히 활발히 진화, 2026 best practice 가 2027 best practice 아닐 수 있어.

Code

MoE 패밀리 비교 Python data·python

moe_families = [
    ("Mixtral 8x7B",   46.7,  12.9, 8,    2,  0,  "softmax"),
    ("Mixtral 8x22B",  141,   39,   8,    2,  0,  "softmax"),
    ("DeepSeek-V2",    236,   21,   160,  6,  2,  "softmax"),
    ("DeepSeek-V3",    671,   37,   256,  8,  1,  "sigmoid"),
    ("Llama 4 Scout",  109,   17,   16,   1,  0,  "softmax"),
    ("Llama 4 Maverick", 400, 17,   128,  1,  0,  "softmax"),
    ("Qwen3 30B-A3B",  30,    3,    128,  8,  0,  "softmax"),
    ("Qwen3 235B-A22B", 235,  22,   128,  8,  0,  "softmax"),
]

MoE 패밀리 — Mixtral, DeepSeek, Llama 4, Qwen3

Mixtral — 문 연 모델

DeepSeek — fine-grained-expert 디자인

Llama 4 — Meta 의 MoE pivot

Qwen3 MoE — 많은 expert, shared 없음

한눈에 보는 trend

Code

External links

Exercise

Progress

댓글 0