LLM-as-judge — use case와 pitfall

~16 min · evaluation, judges

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

Cheap, fast, biased

다른 LLM output grade에 LLM 쓰는 게 가장 cheap한 evaluation scaling 방법. Easily fooled되기도 해. Carefully 쓰면 work accelerate; carelessly 쓰면 진짜 quality predict 안 하는 comforting 숫자 produce.

LLM judge work하는 곳

Format checking — JSON valid? response가 모든 required field 포함?
명확한 rubric으로 pairwise comparison.
Categorical classification (refund vs complaint vs feature request).
reference 기준 hallucination detection ("response가 source 모순?").

LLM judge 실패하는 곳

주관적 quality에 numerical scoring (1–5 helpfulness score noisy).
judge가 test 받는 모델이랑 같은 blind spot 가진 곳.
Adversarial situation — judge model도 prompt-injected 가능.

Judge calibrate

judge decision 20% spot-check를 사람으로. agreement rate 85% 미만이면 그 task에 judge usable X; rubric tighten하거나 더 많은 human eval.

Code

Hallucination judge·markdown

## Task
Given the source documents and the response, decide whether the response contains any claim not supported by the source.

## Output
{
  "hallucinated": true | false,
  "unsupported_claims": ["<verbatim claim text>", ...],
  "verdict_confidence": "high" | "medium" | "low"
}

## Source
{{source}}

## Response
{{response}}

External links

Exercise

system의 specific task에 LLM judge build. human-graded output 30개에 calibrate. agreement 85% 미만이면 rubric tighten하거나 categorical scoring으로 교체.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousPairwise comparison — absolute scoring 거짓말할 때 Next →Prompt regression test

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.