Router, Top-K, Shared experts

Router 가 MoE 의 두뇌

Router 는 작은 network — 보통 토큰의 hidden state 를 expert 위 logits vector 로 매핑하는 single linear layer — 지만 전체 routing 결정 책임. Router 는 모델의 나머지와 함께 학습. Router 가 나쁘면 expert 들이 쓸모없는 training signal 받고 다 무너져.

Top-K 가 어떻게 작동

각 토큰에 대해 router 가 N 점수 emit (expert 당 하나). Top-K selection 이 가장 높은 K 점수 유지; 나머지 zeroed out. 유지된 점수는 softmax 통과 (또는 일부 2025+ 디자인은 sigmoid) 해서 routing weights 됨. 각 선택된 expert 의 output 이 weight 로 곱해지고, K 개 weighted output 합산.

실제로 흔한 K 값

Top-1: Switch Transformer 의 원조 선택. 가장 cheap, expert collapse 가장 쉬움. Llama 4 Scout 가 top-1.
Top-2: Mixtral 의 선택. 두 expert 가 routing 실수에 hedge; modest extra compute. 2023–2024 의 dominant 선택.
Top-6 to Top-8: DeepSeek 의 fine-grained-expert 디자인. 작은 expert 많고, 토큰당 더 많이 선택, 더 nuanced specialization.

Shared experts

DeepSeek-V2/V3 와 Gemma 4 MoE 는 router 결정과 무관하게 모든 토큰에 항상 활성화하는 1–2 shared experts 포함. 항상 켜진 안정적 baseline 제공해서 routed expert 가 흔한 패턴 다시 안 배워도 돼. 작은 architectural 추가지만 training-stability 큰 효과.

Sigmoid vs softmax routing

대부분 초기 MoE 가 expert logits 위 softmax 사용. DeepSeek-V3 가 auxiliary-loss-free balancing 으로 sigmoid routing 도입 — 각 expert score 독립 (zero-sum 경쟁 없음), load balancing 은 explicit auxiliary loss 대신 logits 에 학습된 bias 항 추가로 달성. 이로 DeepSeek-V3 는 auxiliary loss 완전 drop 하고도 expert collapse 피했어.

Code

Top-K 와 softmax weighting 으로 router·python

import torch
import torch.nn as nn
import torch.nn.functional as F

class TopKRouter(nn.Module):
    def __init__(self, d_model, num_experts, k=2):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        self.k = k

    def forward(self, x):
        logits = self.gate(x)                  # (..., num_experts)
        topk_vals, topk_idx = logits.topk(self.k, dim=-1)
        weights = F.softmax(topk_vals, dim=-1)
        return topk_idx, weights

DeepSeek-V3 style sigmoid routing with bias·python

class SigmoidRouter(nn.Module):
    def __init__(self, d_model, num_experts, k=8):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts, bias=False)
        # Learned per-expert bias for load balancing (no aux loss needed).
        self.expert_bias = nn.Parameter(torch.zeros(num_experts))
        self.k = k

    def forward(self, x):
        scores = torch.sigmoid(self.gate(x)) + self.expert_bias  # (..., num_experts)
        topk_vals, topk_idx = scores.topk(self.k, dim=-1)
        # Sigmoid scores are independent; no softmax over selected experts.
        return topk_idx, topk_vals

Exercise

MoE 모델 셋 골라 — Mixtral 8x7B, DeepSeek-V3, Llama 4 Maverick — 그들의 (N experts, top-K, shared experts) 튜플 적어. 디자인 space 가 'few large experts top-2' (Mixtral) 에서 'many small experts top-8 with shared' (DeepSeek) 으로 어떻게 옮겨갔는지 주목. 그 shift 가 MoE 2024–2026 의 가장 중요한 design-evolution 스토리.