현대 LLM 의 네 축

나침반

모든 현대 LLM 은 네 개의 독립 축에 위치해. 독립이라는 건 한 축의 변화가 다른 축들과 거의 직교한다는 뜻이야 — DeepSeek-R1 과 DeepSeek-V3 는 같은 backbone, 같은 inference engine, 거의 같은 product surface 인데 행동은 매우 달라. 차이는 순전히 post-training 축에 살아.

Axis 1 — Backbone architecture

네트워크가 실제로 어떻게 wired 되었나? Dense vs MoE vs hybrid. 파라미터가 몇 개 존재하고, 토큰당 몇 개가 활성화돼? Attention variant 는 — full attention, GQA, MQA, MLA? Position 은 어떻게 인코딩 — RoPE, ALiBi, NoPE, YaRN-extended? Hugging Face repo 의 config.json 에서 읽을 수 있는 건 다 여기 살아.

Axis 2 — Post-training

인터넷 크롤하는 autoregressive pretraining 끝난 다음에 뭘 했나? RLHF, DPO, GRPO, RLVR, distillation, reasoning-focused RL. 이게 raw next-token machine 을 assistant 로 바꾸는 데고, 2025–2026 frontier 모델 "personality" 와 capability 의 대부분이 실제로 여기서 와. DeepSeek-V3 → DeepSeek-R1 점프는 순수 post-training; weights file shape 는 같고 training recipe 만 달라.

Axis 3 — Inference strategy

런타임에 모델을 어떻게 부르나? Standard autoregressive, extended thinking with reasoning tokens, test-time compute scaling, speculative decoding, beam search, grammar-constrained decoding. 같은 weights, 다른 runtime behavior. Extended thinking 끈 Claude Sonnet 과 켠 Claude Sonnet 은 같은 checkpoint 야 — inference 축만 바꾼 거야.

Axis 4 — Product behavior

모델을 무엇으로 감쌌나? RAG, tool use, agent loop, system prompt, guardrail, memory, multi-step orchestration. ChatGPT-the-product 는 o3-the-model 위에 얇은 레이어고, "Computer Use 켠 Claude" 는 같은 Claude 에 다른 tool surface 야. Application layer, architecture 아님.

Code

Model card 를 네 축으로 읽기·python

# Sample card text:
# "DeepSeek-R1, a 671B-A37B MoE model with MLA attention,
#  trained with GRPO reinforcement learning on top of V3,
#  served with extended thinking <think> blocks,
#  available with native tool use."

axes = {
    "backbone":       "671B-A37B MoE, MLA attention",   # axis 1
    "post_training":  "GRPO RL on top of V3",            # axis 2
    "inference":      "extended thinking <think> blocks",# axis 3
    "product":        "native tool use",                 # axis 4
}
# Each phrase belongs to exactly ONE axis — that's the test.

같은 weights, 다른 축·python

# Same Claude Sonnet checkpoint, both modes:
fast_response   = call_claude(messages, thinking=False)
careful_response = call_claude(messages, thinking=True, budget=8000)
# Backbone, post-training, product — all identical.
# Only inference axis (axis 3) changed.

Exercise

실제로 쓰는 모델 하나 골라 (Claude, GPT, Gemini, Llama, Qwen, 뭐든). 4행 표 만들어 — 축마다 한 행 — 그 모델에 대해 아는 걸 적어. 어떤 축이 mental model 에서 가장 *모호한지* 가 중요해. 거기가 마케팅이 거짓말할 여지가 가장 많은 자리야.

나침반