Judge Prompt Design

Judge prompt 가 코드 안의 rubric

Judge 는 prompt 만큼만 좋아. 모호한 prompt 가 모호한 judge 만들어. 좋은 judge prompt 작성의 규율은 좋은 annotation rubric 작성과 동일해 — 같은 5개 요소, LLM 용으로 rendering 되는 것뿐.

강한 judge prompt 의 5개 요소

Role 과 stance — "You are an expert evaluator focusing on factual correctness." "Be a judge" 가 아니라.
정확한 criteria — pass 와 fail 이 뭐야, 평이한 말로 모호함 없이.
Worked example — 적어도 pass 와 fail 하나씩, verdict 와 reasoning 와 함께. Few-shot 이 calibration 극적으로 tighten.
Edge case 처리 — refusal, partial answer, off-topic correct answer 어떻게 다룰지.
엄격한 output 형식 — named field 의 JSON, parsing robust 하게.

Rationale-before-verdict trick

Judge 한테 verdict 만들기 전에 step-by-step 으로 생각하라고 해. Model 이 결정 전 verbalize 할 때 더 잘 reason. 구체적: "reasoning" 이 "verdict" 앞에 오게 JSON 구조화. 이 단일 변경이 judge 정확도 보통 3-8 포인트 올려.

원칙: 옳은 순서로 생각 소리내는 judge 가 더 잘 judge. Output schema 에 항상 rationale 을 verdict 앞에.

어려운 task 엔 few-shot 필수

미묘한 criteria (faithfulness, helpfulness, tone) 엔 zero-shot judge 가 drift. 3-5 worked example — disagreement case 포함 ("이 output 은 PASS 처럼 보이지만 실제론 FAIL, 왜냐면...") — 이 judge 를 rubric 에 lock.

Judge 는 temperature 0

Judge 는 같은 input 에 deterministic 해야. temperature=0 설정하고 가능하면 seed 고정. 안 그러면 화요일과 수요일에 같은 eval run 이 다른 숫자 줘 — 변경과 noise 구분 못 해.

Code

강한 judge prompt — 전체 skeleton·python

JUDGE_PROMPT = """
You are an expert evaluator measuring whether assistant responses are faithful to the provided context.

## Criteria
- PASS: every factual claim in the response is supported by the context.
- FAIL: the response contains at least one factual claim not supported by the context.
- Style, tone, length, and grammar do NOT influence the verdict.
- A refusal to answer is FAIL only if the context contains the answer.

## Examples

Example 1
Context: \"The Eiffel Tower is 330m tall and located in Paris.\"
Question: \"How tall is the Eiffel Tower?\"
Response: \"The Eiffel Tower is 330m tall.\"
Verdict: PASS
Reasoning: The 330m claim is directly supported by the context.

Example 2
Context: \"The Eiffel Tower is 330m tall and located in Paris.\"
Question: \"How tall is the Eiffel Tower?\"
Response: \"The Eiffel Tower is 324m tall, including its antenna.\"
Verdict: FAIL
Reasoning: 324m and the antenna detail are not in the context.

## Now evaluate this case
Context: {context}
Question: {question}
Response: {response}

Reply ONLY with JSON in exactly this shape (reasoning FIRST):
{{\"reasoning\": \"<one or two sentences>\", \"verdict\": \"PASS\" or \"FAIL\"}}
"""

Robust JSON parsing — judge 가 가끔 drift·python

import json, re

def parse_judge_output(raw):
    raw = raw.strip()
    # Strip ```json fences if the judge added them
    fenced = re.search(r'```(?:json)?\s*(\{.*\})\s*```', raw, flags=re.S)
    if fenced:
        raw = fenced.group(1)
    try:
        d = json.loads(raw)
        if d.get("verdict") not in ("PASS", "FAIL"):
            raise ValueError("unexpected verdict")
        return d
    except Exception as e:
        return {"verdict": "FAIL", "reasoning": f"unparseable judge output: {e}"}

Judge prompt 가 코드 안의 rubric

강한 judge prompt 의 5개 요소

Rationale-before-verdict trick

어려운 task 엔 few-shot 필수

Judge 는 temperature 0

Code

External links

Exercise

Progress

댓글 0