개념: AI 가 AI 평가

왜 model 평가에 model 을 써

미묘한 quality 질문 — '이 summary faithful?', '이 답 helpful?', '이 답이 brand tone 과 match?' — 에는 쓸 수 있는 regex 가 없어. Meaning, context, pragmatics 이해하는 grader 필요. 그 grader 는 judge 로 prompt 된 또 다른 LLM 이야.

패턴은 straightforward 해:

System under test 가 output 생성.
Judge prompt 가 input, output, criteria 를 제시.
Judge model 이 score (또는 pass/fail) + 작성된 rationale 반환.
Score 가 deterministic metric 과 같은 dashboard 로 들어가.

왜 (대부분) 작동해

강한 judge model — GPT-5, Claude Opus 4.x, Gemini Pro — 은 많은 평가 task 에서 인간 rating 과 놀라울 정도로 align. 다수 학술 연구가 LLM judge 가 인간끼리 동의하는 정도로 인간과 동의함을 보여 (보통 70-85%). 대부분 product eval 에 충분 — 특히 지속적 인간 annotation 을 못 감당할 때.

원칙: LLM judge 는 sampling-aware, prompt-engineered 측정 도구야. 도구처럼 다뤄: calibrate, validate, 다시 reading 체크.

실패하는 곳

Position bias — pairwise 비교에서 judge 가 처음 또는 마지막 보여진 output 을 chance 보다 자주 골라.
Self-preference — base model 에 train 된 judge 가 같은 base model 의 output 선호.
Length bias — judge 가 자주 더 긴 output 보상, 짧은 게 더 나아도.
Cost — 모든 judged sample 이 추가 API call. Scale 에선 중요.
Brittle prompt — rubric 의 작은 변경이 verdict flip.

Weaker system 에 stronger judge 써

모범 사례: judge model 이 평가받는 model 만큼 capable 해야, 이상적으론 더 강해야. Claude Opus output 을 Haiku 로 judge 하는 건 trouble 부르기 — Haiku 가 Opus 가 도입한 미묘함 놓쳐.

Code

MVP LLM judge·python

JUDGE_PROMPT = """You are an expert evaluator.

User asked: {input}
Assistant answered: {output}

Is the answer correct, complete, and helpful for what the user asked?
Reply ONLY in JSON:
{{\"verdict\": \"pass\" or \"fail\", \"reason\": \"one short sentence\"}}"""

import json

def llm_judge(case, output, judge_model):
    prompt = JUDGE_PROMPT.format(input=case["input"], output=output)
    raw = judge_model.complete(prompt, temperature=0)
    parsed = json.loads(raw)
    return parsed["verdict"] == "pass", parsed["reason"]

Judge 부르는 시점 — decision tree·python

def pick_grader(eval_case):
    """Cheaper graders first; judge as last resort."""
    if eval_case.has_exact_reference and eval_case.is_short:
        return "exact_match"
    if eval_case.has_reference_phrase:
        return "contains"
    if eval_case.is_format_check:
        return "schema_validate"
    if eval_case.is_classification:
        return "f1_score"
    if eval_case.has_reference_text and eval_case.compare_meaning:
        return "bertscore"
    return "llm_judge"  # everything else

개념: AI 가 AI 평가

왜 model 평가에 model 을 써

왜 (대부분) 작동해

실패하는 곳

Weaker system 에 stronger judge 써

Code

External links

Exercise

Progress

댓글 0