Emergent 능력과 test-time compute

Emergent 능력은 특정 스케일에서 갑작스럽게 나타나는 능력 — 큰 모델엔 있고 작은 모델엔 없고, 전환이 점진적이 아니라 급격해.

고전 예시

In-context learning. GPT-3(175B)는 prompt의 몇 가지 예시로 task 익혔어, GPT-2(1.5B)는 안정적으로 못 했어.
Chain-of-thought 추론. 특정 스케일 이상에서, 모델한테 "단계별로 생각해" 시키면 수학/추론 답이 극적으로 좋아져. 그 스케일 이하에선 CoT prompting이 거의 변화 없음.
설명에서 코드 생성. 작은 모델은 코드 완성은 가능하지만 영어 spec에서 동작하는 프로그램 합성은 드물어, 큰 모델은 일상적으로 가능.
다국어 전이. 주로 영어로 학습됐어도 파라미터가 어떤 임계 넘으면 언어를 가로질러 잘 동작.

Test-time compute 스케일링

더 최근 발전: OpenAI o1, DeepSeek-R1, Gemini 2.5 Pro thinking 모드 같은 모델은 쿼리당 더 많은 inference compute 할당 가능 — 가시적 chain-of-thought, 다중 시도 추론, 내부 검증. 고정 모델이 inference 시점에 더 많은 compute 써서 어려운 task에 개선 가능, "모델 품질 = 파라미터 수 단독"이라는 가정을 깸.

최근 연구는 일부 "emergent" 능력이 진짜 갑작스러운지 평가의 아티팩트인지 — 측정값의 불연속성이지 근저 능력의 불연속성은 아닌지 — 의문 제기. 더 깔끔한 결론 — 모델 능력은 스케일 × 데이터 × post-training × test-time compute의 함수, 넷이 서로 상호작용.

Code

Test-time compute knob (Anthropic API)·python

import anthropic

client = anthropic.Anthropic()

# Allocate up to 80K thinking tokens before the answer
response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=128_000,
    thinking={"type": "enabled", "budget_tokens": 80_000},
    messages=[{"role": "user", "content":
        "Find x such that 3^x + 5^x = 8^x."}]
)
# The same model produces dramatically better answers on hard math
# when given more thinking budget. Tune budget_tokens per task.

Emergent 능력과 test-time compute

고전 예시

Test-time compute 스케일링

Code

External links

Exercise

Progress

댓글 0