Judge 편향과 Calibration

Judge 는 편향 있어. 측정해.

LLM judge 는 알려진 systematic 에러 있는 도구야. 2도 높게 reading 하는 thermometer 처럼 다뤄 — usable, but offset 보정할 때만.

중요한 4개 편향

Position bias — pairwise 비교에서 judge 가 처음 보여진 (또는 가끔 마지막) output 선호. 방어: 두 ordering 돌려서 평균.
Length bias — judge 가 더 긴 output 보상, 짧은 게 더 나아도. 방어: rubric 에 "Length should not influence the verdict" 포함; 같은 답의 truncated/expanded variant 에 spot-check.
Self-preference — judge 가 자기 model family 모양의 output 선호. 방어: judge 로 다른 model family 사용, 또는 여러 judge 써서 agreement 봐.
Verbosity / confidence bias — judge 가 hedged 또는 uncertain 보다 assertive 매끄러운 output 선호, uncertainty 가 적절해도. 방어: 적절한 hedging 에 대한 명시적 rubric 조항.

Judge calibration 방법

100 case sample.
인간이 rate.
같은 100 case 에 judge 돌려.
Cohen's kappa 또는 simple agreement rate 계산.
Disagreement 봐. Prompt update. 다시 돌려.
Inter-human agreement 수준에 도달할 때까지 iterate.

원칙: Specific task 에 인간 대비 calibrate 안 된 judge 는 guess 야. Calibration 이 instrument 와 story 의 차이.

여러 judge → ensemble

High-stakes eval 에 다른 model family 의 2-3 judge 돌리고 aggregate. Judge 간 disagreement 가 useful signal: 그 case 들이 인간 review 필요. 단일 judge pipeline 보다 더 비싸지만 극적으로 더 reliable.

Code

Calibration: 인간과의 agreement 측정·python

from sklearn.metrics import cohen_kappa_score

def calibrate(human_labels, judge_labels):
    """Both are lists of 0/1 (or PASS/FAIL)."""
    agree = sum(h == j for h, j in zip(human_labels, judge_labels)) / len(human_labels)
    kappa = cohen_kappa_score(human_labels, judge_labels)
    return {"agreement": agree, "kappa": kappa}

# Target: kappa > 0.7. Anything below 0.4 means your judge is barely
# better than random for this task.
# Inter-human agreement is the ceiling — judges rarely beat humans
# at agreeing with humans.

Length-bias 진단·python

# Quick experiment: take 30 outputs, ask the judge to rate them.
# Then ask it to rate truncated versions (50% length).
# If average score drops significantly, your judge has length bias.
import random

def length_bias_test(cases, judge):
    full_scores, short_scores = [], []
    for c in cases:
        out = c["output"]
        full_scores.append(judge(c, out))
        short = out[:len(out)//2] + "..."
        short_scores.append(judge(c, short))
    print(f"avg full: {sum(full_scores)/len(full_scores):.2f}")
    print(f"avg short: {sum(short_scores)/len(short_scores):.2f}")
    # Big gap → judge rewards length more than content.

Two-judge ensemble — agreement gate·python

def ensemble_judge(case, judge_gpt, judge_claude):
    a = judge_gpt(case)
    b = judge_claude(case)
    if a["verdict"] == b["verdict"]:
        return {"verdict": a["verdict"], "agreement": True}
    return {"verdict": "REVIEW", "agreement": False, "a": a, "b": b}

# Cases marked REVIEW go to a human queue. Cheap insurance for high-stakes evals.

Judge 편향과 Calibration

Judge 는 편향 있어. 측정해.

중요한 4개 편향

Judge calibration 방법

여러 judge → ensemble

Code

External links

Exercise

Progress

댓글 0