Parameter Count 와 Scaling

Empirical scaling law

Language model 에 Kaplan et al. (2020) 와 Chinchilla (Hoffmann et al., 2022) 가 loss 가 parameter 와 token 의 smooth power law 로 줄어든다는 거 보여줌. 대략: optimal training compute 가 parameter scaling 과 token scaling 사이 균등 분할. 1.4 trillion token 에 train 한 70B-parameter model (Chinchilla-optimal) 이 300 billion token 에 train 한 175B-parameter model 을 outperform — 더 적은 compute 사용해도.

Lesson: 큰 model 만으로는 비례적으로 더 많은 data 없으면 도움 안 됨. Pre-Chinchilla 의 많은 'huge model, modest data' 노력이 사실 under-trained 였어.

팁: '내 model 얼마나 커야 해?' 가 잘못된 질문. 옳은 질문: 'compute budget 주어진, parameter 와 token 의 어떤 split 이 loss minimize?'. Chinchilla-style scaling law 가 surprisingly precise answer.

Scaling 이 실제 사주는 것

Lower training loss — predictable, power law 따라.
Emergent capability — 일부 능력 (multi-step reasoning, code generation) 이 어떤 scale 위에서 갑자기 나타남.
Better few-shot / zero-shot performance — 큰 model 이 few example 에서 generalize.
최상단의 diminishing returns — 100B 에서 200B parameter 로 가는 게 accuracy gain 보다 훨씬 비쌈, 대부분 current setup 에서.

Application work 에

Frontier scale 에서 거의 train 안 함. Pretrained model 을 menu (3B, 7B, 13B, 32B, 70B, ...) 에서 골라 fine-tune. Scaling law 직관이 중요한 이유 — 어느 model size 부터 시작할지 알려줘, 보통 accuracy requirement 만족하는 가장 작은 거 — inference 비용이 parameter count 와 scale.

원칙: 더 큰 게 항상 더 좋은 건 아냐. Accuracy target 치는 가장 작은 model size 골라. Inference 비용이 모든 request 마다 compound, 그 비용과 한 번이 아니라 영원히 살아.

Code

Parameter count as a model-size signal·python

def count_params(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

# Common LLM sizes (parameters):
# 1.5B   - small chat / on-device
# 3B     - small with reasoning
# 7B     - sweet spot for most local fine-tuning
# 13B    - middle ground
# 32B-70B - serious fine-tune capacity
# 100B+  - frontier; usually only inference

total, trainable = count_params(model)
print(f"params: {total/1e9:.2f}B total, {trainable/1e9:.2f}B trainable")

Empirical scaling law

Scaling 이 실제 사주는 것

Application work 에

Code

External links

Exercise

Progress

댓글 0