Test-time compute scaling — '더 길게 생각하기' 가 사주는 것

Core 메커니즘

Test-time compute scaling (TTS) 가 더 어려운 문제에 더 많은 inference compute (즉, 더 많은 intermediate 토큰 생성) 쓰는 practice. 같은 모델 weights, 같은 학습, 근데 모델이 요청당 변동 compute 양 쓸 수 있어. Trivial 질문은 직답. 어려운 거는 최종 응답 전 수천 internal reasoning 토큰.

왜 작동

각 추가 reasoning 토큰이 모델에게 hidden state 업데이트, intermediate belief refine, self-correct 할 또 한 step 줘. 답이 cache 에 없는 문제 (수학, multi-step coding, planning) 에서 더 많은 thinking 토큰이 empirically 측정 가능한 더 좋은 답 yield. 본질적으로 retrieval ("프랑스 수도") 인 문제에서 더 많은 토큰은 wasted compute.

경제 모양

TTS 가 inference budget 을 capability 로 변환. Pretraining 이 compute 를 이미 아는 weights 로 변환하는 데 비해, TTS 는 compute 를 알아내는 thought process 로 변환. 다른 cost curve 갖는 다른 종류의 capability 획득.

Empirical 증거

OpenAI o-series, DeepSeek-R1, Anthropic extended thinking 다 비슷한 패턴: thinking-token budget 두 배가 어려운 reasoning benchmark (Frontier Math, ARC-AGI) 에서 의미 있는 gain yield, diminishing returns. 어떤 임계 아래면 모델 완전 실패; saturation 점 위면 extra thinking 거의 도움 안 됨. 모양 대략 logarithmic.

Control surface — thinking budget

현대 reasoning API 가 explicit budget control 노출 — Claude 의 budget_tokens, Gemini 의 thinking budget, OpenAI 의 reasoning effort level. Developer 가 요청당 TTS 얼마나 쓸지 결정. 대부분 production 팀이 신경 쓰는 실용 knob: latency vs quality 가 이제 fixed 모델 속성 아닌 explicit dial.

TTS 가 안 하는 것

TTS 가 근본적으로 약한 모델을 강하게 만들지 않아. 기저 모델이 지식이나 skill 부족하면 더 많은 thinking 토큰은 더 많은 incorrect text 만 생산. TTS 는 모델이 이미 아는 것을 amplify; 무에서 지식 만들지 않아.

Code

Reasoning model 호출 with explicit thinking budget·python

from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-7-1m",
    max_tokens=4096,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
)

# The response includes thinking blocks AND the final answer.
for block in response.content:
    if block.type == "thinking":
        print("[thinking]:", block.thinking)
    elif block.type == "text":
        print("[answer]:", block.text)

Adaptive budget 전략 (pseudocode)·python

def call_with_adaptive_budget(prompt, *, hint_difficulty=None):
    if hint_difficulty == "trivial":
        return llm(prompt, thinking_budget=0)        # standard mode
    if hint_difficulty == "medium":
        return llm(prompt, thinking_budget=2000)
    return llm(prompt, thinking_budget=16000)         # hard problems

Exercise

접근 가능한 reasoning model 가져와. 같은 어려운 문제 (multi-step 수학 word problem 또는 logic puzzle) 를 세 가지 다른 thinking budget — small, medium, large — 에서 실행. 답이 어떻게 변하는지 (또는 안 변하는지) 와 cost 가 어떻게 scale 하는지 적어. 실제로 쓰는 모델의 TTS curve 내재화 가장 cheap 한 방법.