Chinchilla 함정 — 돈이 있는 곳에 최적화

"Chinchilla 함정"은 Chinchilla 통찰의 특정 실패 모드 — inference 경제학 고려 없이 compute 최적 학습을 맹목적으로 따르는 것.

Chinchilla는 FLOP당 학습 loss를 최소화. 그 지표가 학습 단계를 최적화. 실제 수익 만드는 단계 — inference — 를 최적화 안 해. 각 쿼리가 모델 크기에 비례한 파라미터 compute 사용, 모델 배포 수명에 걸쳐 수억~수십억 번.

구체적으로 — 5.6T 토큰으로 학습한 Chinchilla 최적 280B 모델은 15T 토큰으로 학습한 비최적 8B 모델과 학습 비용 비슷해. 하지만 280B는 inference 쿼리당 35배 비싸. 십억 쿼리 서빙할 거면 8B over-train 모델이 경제적으로 큰 차이로 이김 — Chinchilla 기준으로 학습이 "낭비" 같아 보여도.

Post-Chinchilla 종합

총 소유 비용 최적화 — 학습 compute + 예상 수명 트래픽에 걸친 inference compute. 고트래픽 배포에선 거의 항상 더 작고 더 오래 학습된 모델 쪽으로 밀어. Llama 3의 8B-at-15T 패턴이 이 regime의 정석 산업 레시피.

Code

TCO comparison: smaller-over-trained vs Chinchilla-optimal·python

def tco(params_b, training_tokens_t,
        queries_lifetime, tokens_per_query=500,
        gpu_hours_cost=2.0):
    # rough surrogate: training cost ~ params × tokens × constant
    # rough surrogate: per-query cost ~ params × tokens_per_query × constant
    train_cost = params_b * training_tokens_t * 1e-3   # made-up constant
    infer_cost = params_b * queries_lifetime * tokens_per_query * 1e-9
    return train_cost + infer_cost, train_cost, infer_cost

# Two designs serving 1B queries:
small_overtr = tco(8, 15, 1_000_000_000)
large_chin   = tco(280, 5.6, 1_000_000_000)
print(f"Small over-trained: total {small_overtr[0]:>10.0f}  (train {small_overtr[1]:.0f}, inf {small_overtr[2]:.0f})")
print(f"Large Chinchilla:   total {large_chin[0]:>10.0f}  (train {large_chin[1]:.0f}, inf {large_chin[2]:.0f})")
# The constants are made-up but the *shape* is correct: the larger model's
# inference cost dominates everything once you're at a billion queries.

Chinchilla 함정 — 돈이 있는 곳에 최적화

Post-Chinchilla 종합

Code

External links

Exercise

Progress

댓글 0