AI Workflow 안의 eval

Eval 이 끼어드는 곳

researcher 가 기억할 때만 돌리는 eval 은 eval 이 아니야. infrastructure 질문은: 개발 loop 의 어떤 순간에 suite 가 fire 되는가?

Local development. 빠른 smoke eval (5-20 case) 가 초 단위로 돌아, engineer 가 push 하기 전에. just eval-fast.
Pull request. 전체 deterministic eval (100-500 case) 가 CI step 으로 돌아. PR 은 pass rate 가 threshold 아래로 떨어지면 merge 못 해.
Pre-release. 유료 LLM judge + human-rated case 포함 전체 eval. release tag 를 gate.
Scheduled (nightly). 전체 dataset 위 long-running suite + production log 대비 drift check.
Online (production). Sampling layer 가 live traffic 캡처, score, regression 알림, 새 edge case 를 offline dataset 으로 돌려보냄.

PR-level eval 이 highest leverage

한 가지만 한다면 PR gate 로 만들어. 대부분 regression 은 한 번에 한 prompt 수정으로 도입되고, 대부분 살아남는 이유는 아무도 suite 다시 안 돌려서야. 모든 PR 의 green eval check 는 가장 싼 순간에 그것들을 잡아.

원칙: workflow 에서 일찍 fire 되는 eval 일수록 regression 은 fix 비용 싸. local < PR < pre-release < production.

Production telemetry 가 loop 닫아

production traffic 1-5% 를 online eval pipeline 으로 sampling 하는 게 offline eval 의 가장 큰 force multiplier 야. 야생에서 발견한 모든 게 내일의 regression test 가 돼. Braintrust 와 Inspect AI 같은 도구는 이 loop 을 specifically 지원해.

Code

GitHub Actions — eval 을 PR gate 로·yaml

# .github/workflows/evals.yml
name: Evals
on: [pull_request]
jobs:
  fast-eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -r requirements.txt
      - name: Run fast eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: just eval-fast
      - name: Compare to baseline
        run: python scripts/eval_diff.py --base origin/main --threshold 0.02
      # PR fails if pass rate dropped by more than 2 points.

Production sampler — offline dataset 먹이기·python

import random
import time

def log_for_eval(request, response):
    """Sample 2% of production calls into the eval queue."""
    if random.random() < 0.02:
        record = {
            "timestamp": time.time(),
            "input": request["input"],
            "output": response["text"],
            "model": response["model"],
            "latency_ms": response["latency_ms"],
        }
        eval_queue.publish(record)

# Downstream worker scores each record async, alerts on anomalies,
# and adds the most informative ones to the offline regression set.

AI Workflow 안의 eval

Eval 이 끼어드는 곳

PR-level eval 이 highest leverage

Production telemetry 가 loop 닫아

Code

External links

Exercise

Progress

댓글 0