lm-evaluation-harness (EleutherAI)

학술-style benchmarking 의 표준

lm-evaluation-harness (EleutherAI, open-source) 는 language model 대비 표준 NLP benchmark 돌리는 de-facto framework. "model X 가 Y 에서 점수 받았어" 라고 읽었으면 그 숫자는 거의 확실히 lm-eval-harness 또는 fork 가 만든 거.

cover 하는 것

200+ benchmark task 기본 제공: MMLU, HellaSwag, ARC, GSM8K, HumanEval, TruthfulQA, BIG-Bench 등.
Provider-agnostic — Hugging Face model, vLLM, OpenAI / Anthropic API, local llama.cpp, MLX 위에서 돌아.
Reproducibility — 정확한 prompt, 정확한 metric 정의, 정확한 post-processing. lm-eval-harness 의 숫자가 팀 간 비교 가능.

안 하는 것

Product-eval framework 아니야. Specific RAG pipeline 이 specific user 질문에 답하는지 측정 안 도와. Underlying model 이 certain general capability 가지는지 알려줘. Base model 비교 또는 fine-tuning 이 general capability 안 망쳤는지 검증에 써 — 기능 ship 에 X.

원칙: lm-eval-harness 는 model 비교용. promptfoo / DeepEval / Braintrust 는 system 비교용. 둘 다 자리 있음; 서로 대체 X.

Code

Hugging Face model 위 MMLU 돌리기·bash

# pip install lm-eval[hf]
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-3B-Instruct \
  --tasks mmlu \
  --device cuda \
  --batch_size 8 \
  --output_path results/llama-3.2-3b-mmlu.json

여러 benchmark 한 번에·bash

lm_eval --model hf \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
  --tasks mmlu,gsm8k,hellaswag,truthfulqa_mc2 \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path results/qwen2.5-7b-bench.json

YAML 통한 custom task·yaml

# tasks/my_task.yaml
task: my_qa_task
dataset_path: json
dataset_name: null
dataset_kwargs:
  data_files: ./data/my_qa.jsonl
output_type: generate_until
doc_to_text: 'Question: {{question}}\nAnswer:'
doc_to_target: '{{answer}}'
metric_list:
  - metric: exact_match
    aggregation: mean
    higher_is_better: true

lm-evaluation-harness (EleutherAI)

학술-style benchmarking 의 표준

cover 하는 것

안 하는 것

Code

External links

Exercise

Progress

댓글 0