Chatbot Arena 와 MT-Bench

실제 user perception 과 상관 있는 benchmark

Multiple-choice benchmark 는 narrow capability 측정. 실제 user 는 open-ended dialogue 신경 써. 두 benchmark 가 그 gap 채워.

Chatbot Arena (LMSYS 2023→ongoing)

Crowdsourced pairwise 비교: user 가 두 익명 model output 보고 더 나은 거에 vote, 수백만 vote 위 Elo ranking emerge. Field 가 "real user" benchmark 에 가장 가까운 거. lmarena.ai 에서 지속 update.

왜 작동:

야생의 실제 user 의 실제 prompt — fixed benchmark 보다 product reality 에 가까운 distribution.
Pairwise 가 rater 의 calibration 부담 줄임.
Elo aggregation 이 judge diversity 에 robust.
지속 update, contamination 훨씬 어려움.

한계:

영어와 casual chat 으로 skew.
Voter 가 도메인 expert X — 미묘한 factual error 안 잡음.
일부 카테고리 (coding, long-context, reasoning) 이 별개 sub-leaderboard 가짐, aggregate Elo 가 dilute.

MT-Bench (Zheng et al. 2023)

8 카테고리 (writing, reasoning, math, coding, extraction, STEM, humanities, role-play) 의 80 multi-turn open-ended 질문. LLM-as-Judge (보통 GPT-4) 가 응답 1-10 점수. 대부분 Arena 로 대체, 하지만 여전히 논문에 사용.

원칙: Chatbot Arena 가 "user 가 실제로 느끼는 거" 에 가장 가까운 public benchmark. Default model-selection signal 로 쓰고 product-specific eval 로 validate.

Code

Arena ranking 읽기·text

# Visit https://lmarena.ai/?leaderboard
#
# Filter by category for a more useful read:
#   Overall — general chat (English-skewed)
#   Hard Prompts — adversarial / niche prompts (better discriminator)
#   Coding — code-specific Elo
#   Math — multi-step reasoning Elo
#   Vision — multimodal Elo
#
# A 50-Elo gap is roughly the threshold where users notice differences.
# A 100-Elo gap is conspicuous quality difference.

자체 pairwise + Elo evaluation 만들기·python

import math

def update_elo(winner_rating, loser_rating, k=32):
    expected = 1 / (1 + 10 ** ((loser_rating - winner_rating) / 400))
    return (
        winner_rating + k * (1 - expected),
        loser_rating - k * (1 - expected),
    )

ratings = {"model_a": 1500, "model_b": 1500, "model_c": 1500}

for judgment in pairwise_judgments:  # list of (winner_id, loser_id)
    w, l = judgment
    new_w, new_l = update_elo(ratings[w], ratings[l])
    ratings[w], ratings[l] = new_w, new_l

# After enough comparisons, ratings converge to a stable ranking.
# Same math the Arena uses, just on your own dataset and judges.

Chatbot Arena 와 MT-Bench

실제 user perception 과 상관 있는 benchmark

Chatbot Arena (LMSYS 2023→ongoing)

MT-Bench (Zheng et al. 2023)

Code

External links

Exercise

Progress

댓글 0