Eval을 코드처럼 다루기

~12 min · evaluation, engineering

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

Eval이 test받는 system이랑 같은 hygiene 받을 자격

Eval은 software야. Version, dependency, 자체 test (judge calibration), lifecycle 가져. Application 코드랑 같은 engineering discipline으로 다뤄.

실용적 hygiene

Eval script가 prompt 옆 source control에.
Golden set version (semantic version 또는 date).
Single command (pytest evals/) 또는 dedicated runner (Promptfoo, Langfuse)로 돌려.
Report가 queryable form (dashboard, sheet, internal app)에 persist.
Eval failure를 test failure처럼 triage — flake, real regression, 또는 eval bug.

Cultural piece

팀에서 한 사람만 eval 돌릴 수 있으면 eval fragile. Boring하게 easy하게 만들어. 보상은 prompt iteration이 vibe-driven 대신 data-driven 돼.

Code

Pytest suite로 eval·python

# tests/test_support_prompt.py
import pytest
from evals.runner import run_case

@pytest.mark.parametrize("case", load_jsonl("goldens/support/v3.jsonl"))
def test_support_prompt(case, prompt_v8):
    out = call(prompt_v8, case["input"])
    assert all(s in out for s in case["must_contain"])
    assert not any(s in out for s in case["must_not_contain"])
    if "schema" in case:
        case["schema"].validate(out)

External links

Exercise

Eval suite를 pytest (또는 너의 test runner)에 wire. prompt 만지는 PR마다 돌리는 CI step 추가. README에 한 줄 command document.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousEval drift — test set이 거짓말할 때 Next →퀴즈 · 4 questions

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.