Pointwise vs Pairwise Evaluation

LLM 이 judge 하는 두 방법

Pointwise: output 하나 제시, judge 가 1-5 score (또는 pass/fail).

Pairwise: output 둘 side-by-side, judge 가 winner 고름 (또는 tie).

둘 다 useful, 다른 상황에. 많은 팀이 pointwise 가 simpler 해 보여서 default. 실전에선 pairwise 가 자주 더 reliable.

왜 pairwise 가 LLM 에 더 쉬워

Absolute scoring 은 calibration 필요 ("4 vs 5 가 뭘 의미해?"). LLM 은 이걸 악명 높게 못 해 — score 를 scale 중간 근처에 모으고 극단 거의 안 써. Pairwise 는 문제 우회: "A 가 B 보다 좋아?" 가 "A 를 1-5 scale 로 rate" 보다 쉬운 질문.

Position bias 와 defeat 방법

Judge 한테 output 둘 보여주면 처음 나타난 거 (또는 마지막) chance 보다 자주 골라. 방어: 모든 pair 를 두 번 — A 먼저, B 먼저. Verdict 평균. Ordering 간 disagreement 는 judge 가 이 비교에 너무 biased 의미; TIE 로 표시.

원칙: Pairwise 비교는 항상 두 ordering 으로 돌려. Position bias 는 진짜이고 무시하면 ranking flip 시킬 만큼 커.

Pairwise 가 잘못된 도구일 때

Ranking 이 아니라 absolute 숫자 필요 (regulatory report, SLA).
두 시스템 비교가 아니라 시간 따라 한 시스템 평가.
많은 candidate (N=20). Pairwise 는 N(N-1)/2 비교 필요; pointwise 는 N.

Pointwise grounding 에 pairwise 써

흔한 패턴: 작은 set 에 pairwise 돌리고, Elo-style ranking derive, ranking 으로 pointwise calibration anchor ("이게 5 처럼 보여; 이게 3 처럼 보여"). 한 evaluation cycle 비용 들지만 훨씬 honest 한 pointwise judge 만들어.

Code

Position-bias 방어 있는 pairwise judge·python

def pairwise_judge(question, output_a, output_b, judge_model):
    def ask(first_label, first, second_label, second):
        prompt = f"""
Question: {question}

Response {first_label}: {first}

Response {second_label}: {second}

Which response is better? Reply with JSON: {{\"reasoning\": \"...\", \"winner\": \"{first_label}\" or \"{second_label}\" or \"TIE\"}}"""
        return parse_judge_output(judge_model.complete(prompt, temperature=0))

    v1 = ask("A", output_a, "B", output_b)
    v2 = ask("A", output_b, "B", output_a)  # roles swapped
    # Map v2 back: if v2 says A wins, that means output_b won on the second call.
    win1 = v1["winner"]
    win2 = {"A": "B", "B": "A", "TIE": "TIE"}[v2["winner"]]
    if win1 == win2:
        return win1, [v1["reasoning"], v2["reasoning"]]
    return "TIE", [v1["reasoning"], v2["reasoning"]]

명시적 rubric anchor 있는 pointwise scoring·python

POINTWISE_PROMPT = """
Score this response from 1 to 5.

5 = perfect: factually correct, complete, well-structured, no issues
4 = good: correct and complete, minor stylistic issues
3 = acceptable: mostly correct, but missing a notable detail or has a small error
2 = poor: major missing detail or factual error
1 = bad: largely incorrect, off-topic, or unhelpful

Question: {question}
Response: {response}

Reply with JSON: {{\"reasoning\": \"...\", \"score\": 1|2|3|4|5}}
"""

# Without anchors, judges cluster around 4. With anchors, distribution spreads.

Pointwise vs Pairwise Evaluation

LLM 이 judge 하는 두 방법

왜 pairwise 가 LLM 에 더 쉬워

Position bias 와 defeat 방법

Pairwise 가 잘못된 도구일 때

Pointwise grounding 에 pairwise 써

Code

External links

Exercise

Progress

댓글 0