Eval 규율: user 발견 전에 regression 잡기

~14 min · eval, regression, quality

Level 0Observer

0 XP0/64 lessons0/13 achievements

0/150 XP to next level150 XP to go0% complete

Eval 클래스 셋

필요 — unit eval(단일 프롬프트 → expected 패턴), integration eval(full tool 루프 → expected 행동), quality eval(LLM-as-judge나 held-out set의 human spot-check). 하나라도 skip하면 그 클래스의 regression 빠짐.

모델·프롬프트 핀

Eval 스위트가 measurement 도구. 모델 id 날짜 핀, 프롬프트 git 체크인, 데이터셋 버저닝 필수. 이거 없으면 스위트 drift, 'eval 향상' 무의미해짐.

매 프롬프트 변경에 run

프롬프트를 코드처럼 다뤄 — 변경이 PR 통과, eval 스위트 CI에서 run, regression이 merge 막음. cwk-site가 콘텐츠 생성 프롬프트에 이 패턴 사용; cwkPippa가 페르소나 시스템 프롬프트에. 규율이 변경 안전하게 keep하는 거.

원칙: 프롬프트 git에 있고 eval 없으면 프롬프트 진짜 버저닝 안 된 거. 둘 다 버전, 둘 다 gate.

Code

pytest로 unit eval 패턴·python

import pytest
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-haiku-4-5-20251001"  # 재현성 위해 날짜-pinned

CLASSIFY_SYSTEM = "Reply with exactly one word: positive, negative, or neutral."

@pytest.mark.parametrize("text,expected", [
    ("I love this", "positive"),
    ("this is awful", "negative"),
    ("meh", "neutral"),
])
def test_classify(text, expected):
    r = client.messages.create(
        model=MODEL,
        max_tokens=8,
        system=CLASSIFY_SYSTEM,
        messages=[{"role": "user", "content": text}],
    )
    assert r.content[0].text.strip().lower() == expected

Quality scoring 위한 LLM-as-judge·python

JUDGE_PROMPT = """
Evaluate the candidate answer against the expected answer for a customer
support reply. Score 1-5 on factual accuracy, tone, and completeness.
Reply with JSON: {"factual": int, "tone": int, "completeness": int, "reasoning": str}
"""

def judge(question: str, expected: str, candidate: str) -> dict:
    r = client.messages.create(
        model="claude-sonnet-4-6",  # judge는 더 강한 모델 유지
        max_tokens=512,
        system=JUDGE_PROMPT,
        messages=[{
            "role": "user",
            "content": f"Question:\n{question}\n\nExpected:\n{expected}\n\nCandidate:\n{candidate}",
        }],
    )
    import json
    return json.loads(r.content[0].text)

External links

Exercise

프로젝트 critical 프롬프트 하나 골라. Unit eval 5 케이스, integration 1 케이스, LLM-as-judge 1 케이스 작성. 그 프롬프트 파일 변경에 required check로 CI에 wire.

Hint

'CI 비용 너무 높음'이 막으면 PR마다 대신 Batch API로 야간 quality eval run.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousLLM 워크로드 observability Next →퀴즈 · 4 questions

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.