Pairwise comparison — absolute scoring 거짓말할 때

~14 min · evaluation, pairwise

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

인간 (그리고 judge)이 score보다 비교 잘해

"이 output 좋아?" 묻기는 noisy 1-to-5 score 만들고; "A나 B 더 좋아?" 묻기는 tight, reliable comparison. 주관적 task (writing quality, helpfulness, tone)에 pairwise 이김.

Pairwise eval shape

같은 N개 input에 prompt v_old와 v_new 돌려.
각 input에 대해 judge (human이나 LLM)한테 두 output side-by-side 제시.
Judge가 A, B, tie pick.
Aggregate: v_new가 X% 시간 win. X가 50% 의미 있게 위면 significant.

Pitfall

Position bias — judge가 first option 선호. Order randomize.
Length bias — judge가 더 긴 output 선호. Length-control instruction 추가.
Verbosity bias — wordier output이 더 thoughtful 들려. Criteria 명시.
Self-preference — LLM judge가 같은 model family output 선호 경향.

Code

Pairwise judge prompt·markdown

## Task
You are comparing two answers (A and B) to the same support question. Pick the one that better satisfies the rubric.

## Rubric
- Accuracy: cites correct policy.
- Clarity: opens with the verdict.
- Tone: warm but terse, no apology preamble.
- Length: ≤ 100 words.

## Output
{"winner": "A" | "B" | "tie", "reason": "<one sentence>"}

## Question
{{q}}

## A
{{out_a}}

## B
{{out_b}}

External links

Exercise

현재 프롬프트와 candidate revision 사이 50개 input에 pairwise comparison 돌려. 명확한 rubric으로 LLM judge 사용. order randomize. win rate report.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousGolden set — 만들고 유지 Next →LLM-as-judge — use case와 pitfall

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.