Regression Testing 과 Eval-Improvement Cycle

Regression suite 가 system 의 institutional memory

모든 fix 된 bug 가 regression test 자격 있어. 모든 reported failure 가 dataset 의 case 자격 있어. 몇 달 동안 regression suite 가 system 이 만든 모든 실수의 record — 그리고 어떤 것도 silently 돌아오지 않을 보장 — 이 돼.

Eval-improvement cycle

Eval 이 failure flag — pre-release suite 가 regression 잡거나, production sample 이 새 failure mode 보임.
Failure 가 case 가 됨 — 옳은 tag 와 함께 regression dataset 에 추가.
System 고침 — prompt, retrieval, model upgrade 등 진단 가리키는 것.
Eval 이 fix confirm — 새 case pass, 다른 case regress X.
Ship + monitor — production sampling 이 같은 failure pattern 재발 watch.

Regression suite 는 절대 줄지 않아

Ship 된 fix 마다 자라. 6개월 동안 "easy pass" 였던 case retire X — 그게 정확히 model upgrade 가 silently regress 하는 순간. Underlying behavior 가 더 이상 관련 없을 (기능 제거) 때만 retire.

원칙: Fix 된 모든 bug 가 regression test 가 됨. 모든 release 와 함께 자라는 dataset 이 팀 가진 가장 가치 있는 asset.

Ideal 이 아니라 baseline 대비 diff

옳은 CI gate 는 "이전 release 대비 pass rate 떨어졌어?" 지 "pass rate 가 절대 threshold 클리어했어?" 가 아니야. 절대 threshold 가 drift. Diff-against-baseline 이 system 향상하면서 자연스럽게 tighten.

Code

Script 으로서의 eval-improvement cycle·bash

# 1. Eval flags failure on PR
$ just eval-pr
  qa.long_doc.045  FAIL  judge: 'response cited a doc that wasn't retrieved'

# 2. Add the case to the regression set with a tag
$ python scripts/add_to_regression.py \
    --id qa.long_doc.045 \
    --tag 'citation-hallucination' \
    --reason 'Added after regression spotted on 2026-04-15'

# 3. Fix the prompt / retrieval / model / etc.
$ vim prompts/citation.txt

# 4. Re-run eval. The new case must pass; nothing else may regress.
$ just eval-pr
  qa.long_doc.045  PASS
  pass_rate: 0.94 → 0.96  ✓

# 5. Ship.
$ just deploy

Diff-against-baseline gate·python

def gate(current_results, baseline_results, max_drop=0.02):
    cur_rate = sum(r["pass"] for r in current_results) / len(current_results)
    base_rate = sum(r["pass"] for r in baseline_results) / len(baseline_results)
    diff = cur_rate - base_rate
    if diff < -max_drop:
        regressed = [r for r in current_results
                     if r["pass"] is False and any(b["id"] == r["id"] and b["pass"] for b in baseline_results)]
        return False, f"pass-rate dropped {diff:.1%}; specifically: {[r['id'] for r in regressed[:5]]}"
    return True, f"pass-rate diff {diff:+.1%} within tolerance"

Regression Testing 과 Eval-Improvement Cycle

Regression suite 가 system 의 institutional memory

Eval-improvement cycle

Regression suite 는 절대 줄지 않아

Ideal 이 아니라 baseline 대비 diff

Code

External links

Exercise

Progress

댓글 0