Eval 의 종류

모든 eval 을 분류하는 6개 axis

framework 나 metric 고르기 전에 eval 을 6개 axis 위에 위치시켜. 조합이 뭘 만들지 알려줘.

Online vs offline. Online 은 live traffic 위, offline 은 frozen dataset 위.
Reference-based vs reference-free. Reference-based 는 known correct answer 있음 (translation, classification, retrieval). reference-free 는 없음 (summarization, generation, helpfulness).
Pointwise vs pairwise. Pointwise 는 output 하나를 isolation 으로 score. Pairwise 는 둘 비교해서 winner 고름.
Deterministic vs model-graded. Deterministic 은 regex / exact-match / numerical. Model-graded 는 LLM-as-judge.
Black-box vs white-box. Black-box 는 final output 만, white-box 는 중간 단계 (retrieved doc, tool call, chain-of-thought) 까지 봐.
Functional vs safety. Functional 은 task quality, safety 는 toxicity / bias / jailbreak / PII leak / prompt injection.

질문에 맞는 조합 골라

'prompt 변경이 summary 더 나아지게 했어?' → offline, reference-free, pairwise, model-graded, black-box, functional.

'live system 이 jailbreak 당하고 있어?' → online, reference-free, pointwise, model-graded, black-box, safety.

'retriever 가 옳은 doc 찾고 있어?' → offline, reference-based, pointwise, deterministic, white-box, functional.

원칙: 도구 고르기 전에 eval 을 6개 axis 위에 위치시켜. '이 framework 가 우리가 원하는 거 못 해' 의 절반은 사실 '한 번에 두 가지 다른 eval 을 시키고 있어' 야.

Whitebox 는 debug, blackbox 는 coupling 방어

White-box eval (retrieved chunk, tool-call sequence, draft thought 보기) 은 진단에 필수지만 eval 을 구현에 coupling 시켜. system 재설계하면 eval 깨져. Black-box eval (final output 위) 은 구현 변경에 stable 하지만 왜 실패했는지 안 알려줘. 진짜 팀은 둘 다 유지해: 진단용 작은 white-box probe set + regression 용 큰 black-box check set.

Code

axis tagging 으로 mixed suite 길찾기·python

# Tag each eval case so you can slice the suite by axis.
EVAL_CASE = {
    "id": "summarizer.long_doc.001",
    "input": {...},
    "reference": None,                  # axis: reference-free
    "axes": {
        "phase": "offline",
        "reference": "free",
        "comparison": "pointwise",
        "grader": "llm-judge",
        "surface": "blackbox",
        "category": "functional",
    },
    "tags": ["summarization", "long-document", "english"],
}

Pairwise vs pointwise — 같은 task, 다른 질문·python

# Pointwise: 'is this output good?' (absolute scale)
def pointwise(output):
    return judge.score(
        f"Rate this answer on a 1-5 scale for accuracy and completeness:\n{output}"
    )

# Pairwise: 'which of these two is better?' (relative, less calibration burden)
def pairwise(output_a, output_b):
    verdict = judge.complete(
        f"Output A: {output_a}\n\nOutput B: {output_b}\n\nWhich is better, A or B? Answer A, B, or TIE."
    )
    return verdict.strip().upper()

모든 eval 을 분류하는 6개 axis

질문에 맞는 조합 골라

Whitebox 는 debug, blackbox 는 coupling 방어

Code

External links

Exercise

Progress

댓글 0