C.W.K.
Stream
Lesson 03 of 10 · published

Pairwise comparison — absolute scoring 거짓말할 때

~14 min · evaluation, pairwise

Level 0수련생
0 XP0/100 lessons0/14 achievements
0/120 XP to next level120 XP to go0% complete

인간 (그리고 judge)이 score보다 비교 잘해

"이 output 좋아?" 묻기는 noisy 1-to-5 score 만들고; "A나 B 더 좋아?" 묻기는 tight, reliable comparison. 주관적 task (writing quality, helpfulness, tone)에 pairwise 이김.

Pairwise eval shape

  1. 같은 N개 input에 prompt v_old와 v_new 돌려.
  2. 각 input에 대해 judge (human이나 LLM)한테 두 output side-by-side 제시.
  3. Judge가 A, B, tie pick.
  4. Aggregate: v_new가 X% 시간 win. X가 50% 의미 있게 위면 significant.

Pitfall

  • Position bias — judge가 first option 선호. Order randomize.
  • Length bias — judge가 더 긴 output 선호. Length-control instruction 추가.
  • Verbosity bias — wordier output이 더 thoughtful 들려. Criteria 명시.
  • Self-preference — LLM judge가 같은 model family output 선호 경향.

Code

Pairwise judge prompt·markdown
## Task
You are comparing two answers (A and B) to the same support question. Pick the one that better satisfies the rubric.

## Rubric
- Accuracy: cites correct policy.
- Clarity: opens with the verdict.
- Tone: warm but terse, no apology preamble.
- Length: ≤ 100 words.

## Output
{"winner": "A" | "B" | "tie", "reason": "<one sentence>"}

## Question
{{q}}

## A
{{out_a}}

## B
{{out_b}}

External links

Exercise

현재 프롬프트와 candidate revision 사이 50개 input에 pairwise comparison 돌려. 명확한 rubric으로 LLM judge 사용. order randomize. win rate report.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.