Eval drift — test set이 거짓말할 때

~12 min · evaluation, drift

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

Eval set이 stale 돼

2025에 build된 golden set이 2026 traffic representative 안 일 수 있어. Category shift, language 변화, user base grew, policy update. 옛 set에 95% score하는 prompt이 실제 current traffic에 measurably worse 가능.

Drift signal

Prompt 변경 없이 시간 따라 golden set의 pass rate 위로 drift (모델이 어제 질문 답하기 더 좋아짐).
Eval pass rate stable한데 production complaint rate rise.
Eval set에 represent 안 된 새 intent가 production에 출현.
Reference output stale (policy 변경; "correct" 답이 이제 다름).

Maintenance routine

분기마다: production case 100개 sample, 분류, eval set distribution이랑 비교.
새 intent에서 case 추가.
더 representative 안 한 case retire.
현재성에 reference output 재 review.

Code

Drift detection sketch·python

from collections import Counter

def intent_distribution(cases):
    return Counter(c["category"] for c in cases)

eval_dist = intent_distribution(load_golden_set())
prod_dist = intent_distribution(sample_production(n=200))

# Symmetric KL or just diff per category
for cat in set(eval_dist) | set(prod_dist):
    e = eval_dist.get(cat, 0) / sum(eval_dist.values())
    p = prod_dist.get(cat, 0) / sum(prod_dist.values())
    if abs(e - p) > 0.05:
        print(f"drift: {cat} eval={e:.2%} prod={p:.2%}")

External links

Anthropic — Maintain test sets

Exercise

Golden set의 category distribution을 지난 30일 production traffic이랑 비교. golden set에 under-represented된 3개 category 식별 + 추가.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousProduction에서 A/B prompt Next →Eval을 코드처럼 다루기

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.