Memory paradox — VRAM 에 total, FLOPs 에 active

사람들 놀라게 하는 숫자

DeepSeek-V3 는 671B total, ~37B active per token. 그 비율 (~5.5%) 은 토큰당 compute 가 671B dense 모델의 약 5.5% 라는 뜻. 근데 필요한 memory 는 671B dense 의 100%. 671B 다 로드해야 해 — 다 다른 weights 가지고 다음 토큰에 어떤 게 발화할지 몰라.

왜 메모리가 active 따라갈 수 없어

Router 결정은 토큰당 이고 input 의존. 어떤 expert 가 발화할지 미리 알 길 없어, 그래서 모든 expert 준비돼 있어야. 이건 compute 와 근본적으로 달라 — compute 는 선택된 expert 에만 발생. Memory 는 "닿을 수 있는 것", compute 는 "닿은 것". MoE 는 두 번째를 줄이지만 첫 번째는 안 줄여.

Serving cost 모양

MoE 모델에 대해:

VRAM cost ∝ total parameters. Total 사이즈에 맞는 fleet 필요.
토큰당 FLOP cost ∝ active parameters. 같은 total 의 dense 모델보다 FLOP 당 더 많은 토큰/초 서빙 가능.
Throughput 이 batch composition 에 의존. Batch 의 많은 토큰이 같은 expert 원하면 expert-level GPU 과부하; 토큰들 expert 사이 분산되면 parallelism 좋음. 그래서 MoE batch 행동이 예측 어려워.

왜 MoE 가 local 운영 더 어려워

Single 80GB H100 이 FP8 에서도 DeepSeek-V3 못 들어 (~700GB 필요). 4-bit quantize 도 ~170GB. 그래서 frontier scale MoE 는 근본적으로 multi-GPU 또는 multi-node 스토리. 같은 compute equivalence (가상의 ~37B dense) 의 dense 는 H100 한 대의 fraction 에 들어가.

Pricing 함의

API provider 는 토큰당 FLOPs 만 보지 않고 total 서빙 비용 (메모리 + 활용도) 기반으로 가격. 그래서 671B-A37B MoE 모델이 FLOPs/token 비슷하다고 70B dense 보다 5배 싸지 않아. 경제는 active 가 아니라 total 파라미터로 흐름.

Code

MoE 메모리 vs compute 추정·python

def memory_gb_bf16(total_B):           # weights only; +KV cache on top
    return total_B * 2

def memory_gb_fp8(total_B):
    return total_B * 1.0

def memory_gb_int4(total_B):
    return total_B * 0.5

# DeepSeek-V3 has 671B total, ~37B active.
print("BF16:", memory_gb_bf16(671), "GB")  # ~1342 GB
print("FP8:",  memory_gb_fp8(671),  "GB")  # ~671 GB
print("INT4:", memory_gb_int4(671), "GB")  # ~336 GB

# Per-token FLOPs (very rough; just FFN portion)
flops_per_token = 2 * 37 * 1e9             # ~74 GFLOP/token

Exercise

DeepSeek-V3 를 BF16, FP8, INT4 quantization 에서 호스팅하는 데 필요한 메모리 계산. 그다음 consumer 하드웨어 budget — single 80GB H100, 4×H100, 8×H100, 1×Mac Studio M2 Ultra (192GB unified) — 봐. 어떤 구성이 어떤 quantization 들 수 있는지? 이런 napkin 수학이 architectural literacy 의 일상 작업.