C.W.K.
Stream
Lesson 06 of 07 · published

Judge 편향과 Calibration

~22 min · judges, bias, calibration

Level 0Guesser
0 XP0/55 lessons0/10 achievements
0/150 XP to next level150 XP to go0% complete

Judge 는 편향 있어. 측정해.

LLM judge 는 알려진 systematic 에러 있는 도구야. 2도 높게 reading 하는 thermometer 처럼 다뤄 — usable, but offset 보정할 때만.

중요한 4개 편향

  1. Position bias — pairwise 비교에서 judge 가 처음 보여진 (또는 가끔 마지막) output 선호. 방어: 두 ordering 돌려서 평균.
  2. Length bias — judge 가 더 긴 output 보상, 짧은 게 더 나아도. 방어: rubric 에 "Length should not influence the verdict" 포함; 같은 답의 truncated/expanded variant 에 spot-check.
  3. Self-preference — judge 가 자기 model family 모양의 output 선호. 방어: judge 로 다른 model family 사용, 또는 여러 judge 써서 agreement 봐.
  4. Verbosity / confidence bias — judge 가 hedged 또는 uncertain 보다 assertive 매끄러운 output 선호, uncertainty 가 적절해도. 방어: 적절한 hedging 에 대한 명시적 rubric 조항.

Judge calibration 방법

  1. 100 case sample.
  2. 인간이 rate.
  3. 같은 100 case 에 judge 돌려.
  4. Cohen's kappa 또는 simple agreement rate 계산.
  5. Disagreement 봐. Prompt update. 다시 돌려.
  6. Inter-human agreement 수준에 도달할 때까지 iterate.
원칙: Specific task 에 인간 대비 calibrate 안 된 judge 는 guess 야. Calibration 이 instrument 와 story 의 차이.

여러 judge → ensemble

High-stakes eval 에 다른 model family 의 2-3 judge 돌리고 aggregate. Judge 간 disagreement 가 useful signal: 그 case 들이 인간 review 필요. 단일 judge pipeline 보다 더 비싸지만 극적으로 더 reliable.

Code

Calibration: 인간과의 agreement 측정·python
from sklearn.metrics import cohen_kappa_score

def calibrate(human_labels, judge_labels):
    """Both are lists of 0/1 (or PASS/FAIL)."""
    agree = sum(h == j for h, j in zip(human_labels, judge_labels)) / len(human_labels)
    kappa = cohen_kappa_score(human_labels, judge_labels)
    return {"agreement": agree, "kappa": kappa}

# Target: kappa > 0.7. Anything below 0.4 means your judge is barely
# better than random for this task.
# Inter-human agreement is the ceiling — judges rarely beat humans
# at agreeing with humans.
Length-bias 진단·python
# Quick experiment: take 30 outputs, ask the judge to rate them.
# Then ask it to rate truncated versions (50% length).
# If average score drops significantly, your judge has length bias.
import random

def length_bias_test(cases, judge):
    full_scores, short_scores = [], []
    for c in cases:
        out = c["output"]
        full_scores.append(judge(c, out))
        short = out[:len(out)//2] + "..."
        short_scores.append(judge(c, short))
    print(f"avg full: {sum(full_scores)/len(full_scores):.2f}")
    print(f"avg short: {sum(short_scores)/len(short_scores):.2f}")
    # Big gap → judge rewards length more than content.
Two-judge ensemble — agreement gate·python
def ensemble_judge(case, judge_gpt, judge_claude):
    a = judge_gpt(case)
    b = judge_claude(case)
    if a["verdict"] == b["verdict"]:
        return {"verdict": a["verdict"], "agreement": True}
    return {"verdict": "REVIEW", "agreement": False, "a": a, "b": b}

# Cases marked REVIEW go to a human queue. Cheap insurance for high-stakes evals.

External links

Exercise

가장 많이 쓰는 judge 골라. 동료가 50 output rate. Agreement 계산. Kappa 가 0.6 아래면 0.7 넘을 때까지 prompt iterate. 변경한 거 문서화.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.