Metric design — easy한 거 측정 X

~14 min · evaluation, metrics

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

compute하기 easy한 metric은 rarely 중요한 metric

BLEU, ROUGE, exact-match accuracy — pre-LLM NLP 시대 metric이었어, cheap이라. meaning, helpfulness, downstream impact를 poorly capture. 중요한 metric은 task-success: user가 온 거 accomplish했나?

Metric quality tier

Cheap proxy — schema validity, length, required field presence. Necessary지만 insufficient.
Local correctness — 답이 reference match / 맞는 source cite? 더 좋음.
Task success — downstream system / user가 goal accomplish했나? Best.
Business outcome — 이 prompt 변경이 retention / NPS / time-to-resolution 움직였나? 가장 어렵고, 가장 느리고, 가장 가치.

여러 tier 사용

CI에 cheap proxy. golden-set eval에 local correctness. staged rollout에 task success. 큰 investment에 business outcome. 한 숫자로 compress 시도 X; 다른 질문 답해.

Code

Metric stack·python

metrics = {
    "schema_valid": is_valid_json(out),                       # cheap
    "contains_required": all(k in out for k in REQUIRED),     # cheap
    "hallucination_judge": llm_judge_unsupported(out, src),  # local
    "task_success": downstream_signal(request_id),           # task
    "resolved_without_human": ticket_resolved_solo(request_id), # business
}

External links

Exercise

한 prompt에 현재 track하는 metric 나열. 각각 tier로 분류. 가장 track되는 게 가장 decision-relevant인지 식별.

Progress

Progress is local-only — sign in to sync across devices.

← PreviousCost-quality-latency 삼각형 Next →Production에서 A/B prompt

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.