MoE 직관 — 모든 토큰이 모든 expert 필요 안 해

관찰

Sequence 의 모든 토큰이 같은 처리 필요 안 해. "the" 토큰은 복잡한 수학 표현이나 희귀 언어 코드 식별자와 같은 computational pathway 필요 없어. Dense 모델은 난이도 무관하게 모든 토큰에 풀 가격 지불. Mixture of Experts 는 그걸 도전 — 토큰에 *관련된* 네트워크 부분만 활성화하고 나머지는 조용히 있으면 어떨까?

구조적 변화 한 문장으로

MoE 는 각 Transformer 레이어의 단일 dense FFN block 을 expert FFN 의 작은 bank + router 로 교체. 다른 모든 것 — attention, layer norm, embedding, position encoding — dense Transformer 와 동일. Architecture 가 바뀌는 곳은 FFN 안 뿐.

네 가지 ingredient

Experts. 독립적인 feed-forward network. 각 expert 가 같은 architecture 지만 자체 학습 weights. 특화된 sub-network 라고 생각해.
Router (gate). 작은 학습 network — 보통 linear layer + softmax 또는 sigmoid — 토큰마다 expert 점수 생성.
Top-K selection. 점수 가장 높은 K 개 expert 만 토큰 처리. 흔한 선택: top-2 (Mixtral), top-6 (DeepSeek-V2), top-8 (DeepSeek-V3, Qwen3 MoE).
Optional shared experts. 일부 architecture (DeepSeek) 는 모든 토큰에 항상 활성화하는 1–2 expert 포함. 안정적인 baseline pathway 제공.

약속

훨씬 큰 모델의 capacity 를 (total 파라미터 풀이 크니까), 훨씬 작은 모델의 토큰당 compute 정도로 (N 중 K expert 만 발화하니까) 받아. Catch — 진짜 catch — 는 모든 N expert 가 메모리에 살아야 한다는 거, K 만 토큰당 발화해도. 메모리 ≠ compute, 그리고 MoE 가 그 등식을 영구히 깨.

이 lesson 이 아직 안 다룬 것

Routing 이 실제 어떻게 결정하는지; auxiliary load-balancing loss; expert collapse; memory paradox; Mixtral, DeepSeek, Llama 4 차이. 다 오고 있어. 지금 유일한 목표는 모양 내재화: dense FFN → expert bank + router.

Code

Dense FFN vs MoE FFN — 유일한 architectural diff·python

# Dense FFN: one network, all tokens use it.
def dense_ffn(x):
    return W_down @ silu(W_gate @ x) * (W_up @ x)   # all weights active

# MoE FFN: one router + N experts, top-K of them fire.
def moe_ffn(x, router, experts, k=2):
    scores  = router(x)                              # (N,)
    top_k   = topk_indices(scores, k)
    weights = softmax(scores[top_k])
    output  = 0
    for i, w in zip(top_k, weights):
        output += w * experts[i](x)                  # only k experts compute
    return output

BF16 에서 top-K masking routing 결정·python

import torch

def route(x, gate_proj, k=2, num_experts=8):
    logits  = gate_proj(x)                  # (B*T, num_experts)
    weights, idx = logits.topk(k, dim=-1)   # top-K logits per token
    weights = weights.softmax(dim=-1)
    return idx, weights                     # idx: (B*T, k), weights: (B*T, k)

MoE 직관 — 모든 토큰이 모든 expert 필요 안 해

관찰

구조적 변화 한 문장으로

네 가지 ingredient

약속

이 lesson 이 아직 안 다룬 것

Code

External links

Exercise

Progress

댓글 0