MTEB, ARC-AGI, Modern Frontier Benchmark

알 만한 specialized 와 frontier benchmark

MTEB — Massive Text Embedding Benchmark

MTEB (Muennighoff 2023) 는 embedding model 의 de-facto benchmark. 56+ task 를 retrieval, classification, clustering, semantic similarity, reranking 에 걸쳐. Retrieval 위 embedding model 고르면 MTEB 가 첫 reference.

MTEB 에서 읽을 거:

Retrieval sub-leaderboard — RAG 에 가장 관련.
Reranking — cross-encoder reranker 쓰면 관련.
Multilingual variant (MIRACL, MMTEB) — 비영어 production.

ARC-AGI (Chollet 2019, ARC-AGI-2 in 2025)

각 task 에 input/output example 몇 개와 hold-out test input 있는 abstract-pattern visual puzzle. Memorization 저항 위해 specifically design — 모든 test 항목이 novel. Frontier model 이 2023 5-15% score; 2025 광범위한 test-time compute 로 50-87% 도달. 여전히 일반 지능의 의미 있는 frontier challenge 로 간주.

BIG-Bench Hard

BIG-Bench (Google 2022) 는 200+ 다양한 task bundle. BIG-Bench Hard 는 가장 어려운 23. 익숙치 않은 문제 type 에 걸친 model range stress-test 에 유용.

FrontierMath

Research-mathematician 수준 문제. Memorization 불가능하게 design (문제가 professional mathematician 이 만들고 비밀 유지). 2025 frontier model 이 <10% score, reasoning 의 unsaturated benchmark 중 몇 안 되는.

원칙: Public benchmark 가 saturate 하면 field 가 더 어려운 거 발명. 2026 evaluation 의 "옳은" benchmark 는 아직 안 깨진 frontier-relevant benchmark 야.

Code

MTEB — embedding model 고르기·python

# pip install mteb
import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-large-v2")
tasks = mteb.get_tasks(task_types=["Retrieval"], languages=["eng"])
results = mteb.MTEB(tasks=tasks).run(model, output_folder="results/mteb")

# Each retrieval task reports nDCG@10. Average gives a single comparison number.
# Pick the best model for the retrieval pattern that matches your corpus.

MTEB, ARC-AGI, Modern Frontier Benchmark

알 만한 specialized 와 frontier benchmark

MTEB — Massive Text Embedding Benchmark

ARC-AGI (Chollet 2019, ARC-AGI-2 in 2025)

BIG-Bench Hard

FrontierMath

Code

External links

Exercise

Progress

댓글 0