Judge Call 비용 관리

Judge call 빨리 쌓여

강한 judge model 의 500-case eval 은 실행당 $20-100. 모든 PR 에 돌리면 evaluation 에 월 $1000-5000 지불. 관리 가능, 하지만 optimize 할 만해.

비용 절감 6개 기법

Hash 로 cache — 같은 input + 같은 prompt + 같은 judge model = 같은 답. 결과 cache. 변경 안 된 case 의 re-run 은 비용 0.
Cheap judge 먼저, expensive judge 나중 — 초기 verdict 에 cheap model 사용; cheap 이 fail 또는 uncertain 일 때만 expensive judge 로 escalate.
Sample, exhaust 하지 마 — production traffic online eval 에 call 의 1-5% judge, 모두 X.
가능할 때 batch — API 가 지원할 때 request 당 여러 case judge (Anthropic batch API, OpenAI batch endpoint).
Structured-output mode 선호 — malformed JSON 의 retry loop 제거, failed call 비용 5-15% 절감.
Trivial axis 에 model down — Format-compliance 와 length check 는 frontier model 필요 X. 그것들엔 Haiku / GPT-mini / 작은 open model.

원칙: Judge model 강도를 question 난이도에 match. 미묘한 quality 에 frontier model, plumbing check 에 작은 model. Regex-equivalent 질문에 Opus 낭비는 그냥 돈 쓰는 거.

Eval run 당 비용 추적

대부분 LLM framework 가 token usage 노출. Log. 'eval run 당 비용' metric 이 eval dashboard 에 나와야 해. Refactor 가 eval 비용 두 배로 만들면 누가 즉시 알아채야지, 다음 청구 report 때가 아니야.

Code

Hash 키 judge cache·python

import hashlib, json, sqlite3
from pathlib import Path

DB = Path(".eval_cache.sqlite")
conn = sqlite3.connect(DB)
conn.execute("CREATE TABLE IF NOT EXISTS cache (k TEXT PRIMARY KEY, v TEXT, model TEXT, created_at INTEGER)")

def cache_key(prompt, model):
    h = hashlib.sha256()
    h.update(prompt.encode())
    h.update(model.encode())
    return h.hexdigest()

def cached_judge(prompt, model_id, judge_fn):
    k = cache_key(prompt, model_id)
    row = conn.execute("SELECT v FROM cache WHERE k=?", (k,)).fetchone()
    if row:
        return json.loads(row[0])
    out = judge_fn(prompt, model_id)
    conn.execute("INSERT OR REPLACE INTO cache VALUES (?,?,?,strftime('%s','now'))",
                 (k, json.dumps(out), model_id))
    conn.commit()
    return out

Two-tier judge — cheap escalate to expensive·python

def two_tier_judge(case, output, cheap, expensive):
    cheap_v = cheap(case, output)
    if cheap_v["verdict"] == "PASS" and cheap_v["confidence"] >= 0.9:
        return cheap_v  # cheap was confident; stop
    return expensive(case, output)

# In practice: 70-80% of cases settle at the cheap tier.
# Total cost for the suite drops 60-70%, with negligible quality loss
# because the expensive judge sees the hard cases.

Judge Call 비용 관리

Judge call 빨리 쌓여

비용 절감 6개 기법

Eval run 당 비용 추적

Code

External links

Exercise

Progress

댓글 0