'Dense' 가 실제로 의미하는 것

정의: 모든 토큰, 모든 파라미터

Dense 모델 은 모든 토큰이 모든 레이어에서 모든 feed-forward weight 를 활성화하는 decoder-only Transformer 야. 토큰이 FFN block 을 흐를 때 모든 뉴런 거쳐 — routing 도, gating 도, selective activation 도 없어. 70B 모델이면 70B 다 모든 단일 토큰, 모든 위치, 모든 batch 에 참여해.

정의에서 떨어지는 세 속성

Total = active. 보이는 게 계산하는 거. 카드의 파라미터 수가 발화하는 파라미터 수.
토큰당 균일한 compute. "the" 가 복잡한 코드 토큰과 같은 비용. Capacity planning 단순 — token × per-token cost.
Deterministic activation path. 같은 input, 같은 weights, 매번 같은 path. 디버깅할 router decision 없음.

Dense 가 포기하는 것, 지키는 것

Dense 는 total 파라미터를 토큰당 compute 보다 빨리 scale 하는 능력을 포기해. Dense 모델 파라미터 두 배 = 토큰당 FLOPs 대략 두 배. Frontier (100B+) 에서 빨리 비싸져 — 그래서 MoE 가 존재하는 거. 근데 dense 는 거의 모든 다른 걸 지켜: simplicity, debuggability, fine-tune 편의성, 성숙한 서빙 ecosystem (vLLM, llama.cpp, MLX, TensorRT-LLM 모두 dense-first 로 시작).

2026년에 dense 가 여전히 이기는 곳

~30B 미만에서 MoE 오버헤드 (router, load balancing, expert parallelism) 가 효율 gain 을 먹어. 그래서 small/medium open-weight 모델은 여전히 압도적으로 dense — Llama 3 8B, Gemma 3 27B, Qwen3 14B/32B, Phi-4 14B, Mistral NeMo 12B. Local 배포 워크로드에서 예측 가능한 메모리와 quantization-friendly 행동 원하면 dense 가 default.

Code

Dense FFN block in PyTorch (canonical 모양)·python

import torch.nn as nn

class DenseFFN(nn.Module):
    def __init__(self, d_model, d_ffn):
        super().__init__()
        self.gate = nn.Linear(d_model, d_ffn, bias=False)
        self.up   = nn.Linear(d_model, d_ffn, bias=False)
        self.down = nn.Linear(d_ffn, d_model, bias=False)

    def forward(self, x):
        # SwiGLU-style — every token sees every weight, every time.
        return self.down(nn.functional.silu(self.gate(x)) * self.up(x))

Dense FFN 레이어 토큰당 FLOP 추정·python

def dense_ffn_flops_per_token(d_model, d_ffn):
    # gate, up: each d_model x d_ffn
    # down:        d_ffn x d_model
    # Forward pass is 3 matmuls of those shapes.
    return 2 * (3 * d_model * d_ffn)
# Example: Llama 3 70B has d_model=8192, d_ffn=28672
# -> ~1.4 GFLOP per token per layer, just in the FFN block.

Exercise

Hugging Face 에서 small dense 모델 아무거나 하나 골라 (Llama 3 8B, Qwen3 4B, Phi-4 14B) — config.json 열어. d_model (hidden_size) 와 d_ffn (intermediate_size) 찾아. 위 공식으로 토큰당 FFN FLOPs 계산하고 num_hidden_layers 곱하면 그 모델의 토큰당 dense FFN compute budget 대략 나와.