뭘 평가할 거야

LLM output 의 8가지 quality 차원

'이 output 좋아?' 는 측정 가능한 sub-question 들로 분해돼. use case 에 중요한 차원 골라. 한 번에 다 측정하려고 하지 마.

Correctness — 답이 사실적으로 맞아? (QA, RAG, code 에서 최우선)
Faithfulness — 답이 제공된 context 에 근거 있어? (RAG 전용. 답이 correct 지만 unfaithful, faithful 지만 wrong 일 수 있음)
Relevance — 답이 user 의 실제 질문에 답해?
Completeness — 답이 user 가 필요한 걸 다 cover 해, 중간에 멈춰?
Coherence — 답이 논리적으로 들어맞아?
Style / tone — brand 또는 persona 와 맞아? (사람들이 인정하는 것보다 더 중요)
Format compliance — JSON 파싱돼, code 돌아가, citation well-formed?
Safety — toxicity, bias, PII leak, jailbreak 없음?

모든 걸 평가하는 함정

흔한 anti-pattern: 8축 composite score 가 실제 signal 을 가려. correctness 가 15% 떨어지고 coherence 가 5% 오르면 composite 거의 안 움직여 — 그리고 너는 regression 을 ship 해. 차원을 분리해. 각각 추적해. dashboard layer 에서만 합쳐.

원칙: task 에 가장 중요한 두세 차원 골라서 정밀하게 측정해. 그게 부족하다는 증거 나올 때만 더 추가해.

다른 surface 는 다른 scope

번역 도구: correctness + completeness + format compliance. style 은 user 결정.

code assistant: correctness + format compliance (code 가 돌아?) + safety (insecure 패턴 없음).

고객지원 챗봇: relevance + faithfulness + tone + safety. correctness 는 doc grounded 면 faithfulness 아래로 들어가.

tool 쓰는 coding agent: tool-call accuracy + final correctness + cost / step count. one-shot model 보다 white-box trace eval 이 더 중요.

Code

Multi-dimensional eval — 각 axis 독립 추적·python

from dataclasses import dataclass

@dataclass
class MultiDimensionalScore:
    correctness: float
    relevance: float
    completeness: float
    coherence: float
    style: float
    safety: float

    def passed(self, thresholds):
        """Each dimension must clear its own threshold; no averaging."""
        return all(
            getattr(self, dim) >= thresholds.get(dim, 0.7)
            for dim in ("correctness", "relevance", "completeness",
                       "coherence", "style", "safety")
        )

thresholds = {
    "correctness": 0.85,  # critical
    "safety": 0.99,       # near-zero tolerance
    "relevance": 0.8,
    "completeness": 0.7,
    "coherence": 0.7,
    "style": 0.6,         # nice to have
}

RAG-specific: faithfulness vs correctness·python

# An answer can be correct (matches ground truth) but unfaithful
# (the LLM made it up; the retrieved docs did not support it).
# In RAG, unfaithfulness is the dangerous failure mode — it means
# your hallucination defenses are not working.
def faithfulness(answer, retrieved_docs, judge):
    prompt = f"""
Given retrieved context:
{retrieved_docs}

Answer:
{answer}

Is every factual claim in the answer supported by the context?
List unsupported claims if any.
"""
    return judge.complete(prompt)

LLM output 의 8가지 quality 차원

모든 걸 평가하는 함정

다른 surface 는 다른 scope

Code

External links

Exercise

Progress

댓글 0