MMLU: Massive Multitask Language Understanding

세계에서 가장 많이 인용되는 LLM benchmark

MMLU (Massive Multitask Language Understanding), 2020년 Hendrycks 등 발표, 초등 수학에서 전문 법학까지 57개 학술 과목 cover 하는 multiple-choice benchmark. 각 항목은 4 옵션 있는 질문; model 이 하나 고름. 정확도로 score.

MMLU 가 중요했던 이유

Broad coverage — STEM, 인문학, 사회과학, business, 전문 시험. 한 도메인 외워서 score 부풀리기 어렵게.
Multiple-choice 형식 — deterministic scoring, 돌리기 쉬움.
모든 주요 model lab 의 default reporting metric 으로 채택.

MMLU 가 안 알려주는 것

Reasoning 측정 안 해. 질문이 recall 과 comprehension test, 다단계 reasoning 거의 필요 X.
Long-context understanding 측정 안 해. 각 질문이 몇 문장.
Agentic capability, tool use, planning 측정 안 해.
Product quality 예측 안 해. 85% MMLU model 과 75% MMLU model 이 specific RAG pipeline 에서 동일하게 performance 가능.

원칙: MMLU 는 LLM 의 SAT — coarse capability sort 로 유용, specific product 의 quality predictor 로는 무용.

Contamination 문제

MMLU 는 2020 발표. 그 이후 train 된 모든 주요 model 이 봤어 (또는 가까운 paraphrase). 높은 MMLU score 가 understanding 이 아니라 memorization reflect 가능. 커뮤니티가 MMLU-Pro (더 어려운 질문, 더 많은 옵션) 와 비슷한 더 어려운 follow-up 만들어서 응답, 하지만 원본 여전히 widely cited.

MMLU-Pro 와 follow-up

MMLU-Pro (Wang et al., 2024) 가 옵션 수를 4 에서 10 으로 증가, 다단계 reasoning 요구, public training data 와 dataset overlap 줄임. Modern model 비교 시 더 honest 한 2025-2026 후계자.

Code

lm-eval-harness 통한 MMLU 돌리기·bash

lm_eval --model hf \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/mmlu.json

# Reports per-subject + macro-average accuracy.
# Headline number is usually the macro-average across 57 subjects.

Sample MMLU 질문 (anatomy)·text

Subject: anatomy

Which of the following structures is part of the small intestine?
A) Ascending colon
B) Cecum
C) Sigmoid colon
D) Jejunum

Correct: D

# Multiple-choice format. Model picks the letter. Scored on accuracy.

MMLU: Massive Multitask Language Understanding

세계에서 가장 많이 인용되는 LLM benchmark

MMLU 가 중요했던 이유

MMLU 가 안 알려주는 것

Contamination 문제

MMLU-Pro 와 follow-up

Code

External links

Exercise

Progress

댓글 0