GSM8K 와 Math Reasoning

Word problem 이 reasoning 측정 방법

GSM8K (Grade School Math 8K) 는 2-8 reasoning step 필요한 8,500 grade-school 수학 word problem. OpenAI 가 2021 발표, canonical "이 model 이 reason 해?" benchmark 가 됐음.

GSM8K 가 test 하는 것

자연어로 multi-step arithmetic.
Quantitative constraint 아래 reading comprehension.
무관한 정보에 대한 robustness ("Tom 이 사과 3개 사고 공원에 가...").
Chain-of-thought reasoning — 역사적으로 주요 win 요인.

Chain-of-thought story

GSM8K 는 chain-of-thought (CoT) trick 이 popularize 된 곳. Direct-answering 으로 17% accuracy 받던 model 이 답 전 step-by-step 생각하라고 prompt 받으면 50%+ score. CoT 가 이제 대부분 modern model 에 training 통해 baked in; 명시적 CoT prompt 가 예전만큼 중요 X.

원칙: GSM8K 는 multi-step reasoning 측정용. Product 가 reasoning 필요하면 이 benchmark 가 MMLU 보다 더 관련.

Saturation 과 follow-up

GSM8K 는 이제 frontier model 에서 ~90%+. 커뮤니티가 이동:

MATH — competition-level 수학 문제, grade school 보다 훨씬 어려움.
AIME / HMMT — 실제 high school 수학 대회.
GSM-Hard — 더 큰 숫자와 까다로운 wording 의 adversarial GSM8K.
FrontierMath (2024) — 2025 frontier model 이 여전히 <5% score 받는 research-mathematician 수준 문제.

Code

Sample GSM8K 문제·text

Question:
Natalia sold clips to 48 of her friends in April, and then she sold half
as many clips in May. How many clips did Natalia sell altogether in April
and May?

Answer (chain-of-thought):
Natalia sold 48 clips in April.
In May, she sold half as many: 48 / 2 = 24 clips.
Total: 48 + 24 = 72 clips.

#### 72

# The harness extracts the final number after `####` for scoring.

Chain-of-thought 와 GSM8K 돌리기·bash

lm_eval --model openai-completions \
  --model_args model=gpt-4o-mini \
  --tasks gsm8k_cot \
  --num_fewshot 8 \
  --output_path results/gsm8k.json

# 8-shot CoT prompting is the canonical setting.
# Frontier models score 90%+ on this, so it is mostly a sanity check
# rather than a discriminator anymore.

GSM8K 와 Math Reasoning

Word problem 이 reasoning 측정 방법

GSM8K 가 test 하는 것

Chain-of-thought story

Saturation 과 follow-up

Code

External links

Exercise

Progress

댓글 0