Meta-Evaluation 과 Eval Culture 만들기

Evaluation 이 실제로 quality 측정해?

Eval 의 가장 어려운 질문: "보고 있는 metric 이 봐야 하는 metric?" Meta-evaluation 이 자체 evaluation 평가하는 practice.

세 meta-eval 질문

인간과의 상관 — 인간-rated quality 가 움직일 때 metric 이 움직여? 같은 set 위 자동화된 score 와 인간 rating 사이 상관 계산.
User 와의 상관 — User-perceived quality 가 움직일 때 metric 이 움직여? 자동화된 score 와 downstream user signal (thumbs, retention, completion) 사이 상관 계산.
Gaming 에 대한 robustness — Model 이 metric 에 optimize 되면 optimization 이 real quality 향상시켜 score 만?

상관 약하면 metric 은 연극. 교체.

Eval culture 만들기 — 팀 practice

쉽게 만들어 — eval 돌리기 가 30-step 의례가 아니라 single command.
visible 하게 만들어 — 팀 방의 dashboard. PR description 에 인용된 eval score.
required 로 만들어 — eval 결과 없는 PR merge X. Green eval check 없는 release X.
shared 로 만들어 — 모든 팀원이 test case 기여. Dataset 이 communal.
celebrated 로 만들어 — 누군가 eval 로 regression 잡으면 인정. Suite 가 embarrassment 처음 막는 순간이 culture 가 take hold 하는 순간.

원칙: Eval 은 팀이 느낄 때 stick. 아무도 안 여는 dashboard 의 metric 은 metric 아님.

Production-stack eval pipeline

Mature stack: git 의 JSONL dataset → Python 또는 YAML eval runner → Braintrust 또는 self-hosted dashboard → CI/CD gate → dataset 으로 돌아가는 production sampling → 인간 대비 분기별 meta-eval. 각 piece 가 작아; loop 이 가치.

Code

Meta-eval — metric 을 인간 rating 과 correlate·python

from scipy.stats import spearmanr

# 100 outputs, each with: (automated_metric_score, human_rating_1_to_5)
automated = [...]   # list of floats from your metric
human = [...]       # list of integers from human rater

rho, pval = spearmanr(automated, human)
print(f"Spearman correlation: {rho:.3f} (p={pval:.4f})")

# rho > 0.7 → metric tracks human judgment well
# rho 0.4-0.7 → useful but noisy; treat with care
# rho < 0.4 → metric is theatre; replace it

Makefile 로서의 end-to-end eval stack·text

Test data        +   Metrics       +   Runner          +   Dashboard       +   CI/CD
   |                  |                  |                    |                    |
   └── JSONL files    └── Python /       └── promptfoo /      └── Braintrust /     └── GitHub Actions
       in git             scoring             pytest /             custom              blocks bad PRs
                          functions           DeepEval /           webapp
                                              hand-rolled

Meta-Evaluation 과 Eval Culture 만들기

Evaluation 이 실제로 quality 측정해?

세 meta-eval 질문

Eval culture 만들기 — 팀 practice

Production-stack eval pipeline

Code

External links

Exercise

Progress

댓글 0