Evaluation vs Testing

SW test 의 muscle memory 는 LLM 에서 안 통해

20년의 SW engineering 지혜는 말해: unit test 써, exact equality assert 해, 매 commit 에 돌려, 어떤 regression 에서든 build 깨뜨려. 그 muscle memory 전부가 — 함수가 probabilistic model 을 부르는 순간 — 깨져.

같은 input 이 호출마다 다른 output 을 내 (sampling temperature, top-p, stochastic decoding).
temperature=0 이어도, model provider 가 weight 를 조용히 update 해 — 지난주 test 한 snapshot 이 이번주엔 미세하게 다를 수 있어.
acceptable output 이 단일 string 인 경우는 드물어. "Paris is the capital of France." 와 "The capital of France is Paris." 둘 다 맞아.
quality 는 graded 야, binary 가 아니야. 90% 유용한 답은 60% 답과 실질적으로 달라 — 그런데 둘 다 assert is_string(output) 통과해.

eval 로 옮길 때 바뀌는 것

측면	Testing	Evaluation
Output	Deterministic	Stochastic
Pass 기준	exact equality	graded score 의 threshold
실패 의미	bug	distribution 의 regression
Retry 정책	flaky test	statistical 분산 — N번 돌려서 평균
Coverage	line / branch	behavior / edge case / persona
Run당 비용	free (CPU)	paid (LLM API call × N samples × graders)

원칙: Eval 은 budget, tolerance band, sampling 전략이 있는 test 야. boolean assertion 이 아니라 statistical hypothesis test 처럼 동작해.

겹치는 곳, 안 겹치는 곳

deterministic test 는 여전히 써야 돼. retrieval 함수가 known query 에 대해 옳은 doc ID 를 return 하는지 test. tool-call parser 가 malformed JSON 에서 raise 하는지 test. prompt template 이 변수를 정확히 substitution 하는지 test. 이건 SW bug 들이고, SW test 가 잡아. eval 은 그 위에 layer 로 올라가, 시스템이 올바르게 wired 된 상태에서 model 의 행동을 측정해. 둘을 헷갈리는 거 — unit test 안에서 LLM call 을 돌리는 거 — 는 flaky CI 와 분노한 동료를 만들어.

Code

잘못된 방식 — stochastic system 위의 deterministic test·python

# This will pass on Tuesday, fail on Wednesday, and you will blame CI.
def test_summarizer():
    output = summarize_article("...long text...")
    assert output == "This article is about climate change."  # ❌ brittle

옳은 방식 — threshold + N samples·python

from statistics import mean

def eval_summarizer(article, expected_topics, n_samples=5):
    """Run the summarizer N times, score each, return distribution."""
    scores = []
    for _ in range(n_samples):
        output = summarize_article(article)
        coverage = topic_coverage_grader(output, expected_topics)  # 0..1
        scores.append(coverage)
    return {
        "mean": mean(scores),
        "min": min(scores),  # worst-case is what production users feel
        "n": n_samples,
        "pass": min(scores) >= 0.7,  # threshold, not equality
    }

Pytest hybrid — deterministic plumbing + eval block·python

import pytest

# Deterministic — runs on every commit, fast, free.
def test_prompt_template_substitution():
    rendered = render_prompt("hello {name}", name="world")
    assert rendered == "hello world"

# Eval — runs on a schedule or before release, slower, costs money.
@pytest.mark.eval
def test_summarizer_quality():
    result = eval_summarizer(SAMPLE_ARTICLE, EXPECTED_TOPICS, n_samples=10)
    assert result["pass"], f"summarizer regression: {result}"

Evaluation vs Testing

SW test 의 muscle memory 는 LLM 에서 안 통해

eval 로 옮길 때 바뀌는 것

겹치는 곳, 안 겹치는 곳

Code

External links

Exercise

Progress

댓글 0