Benchmark 한계와 Custom Benchmark

왜 public benchmark 가 충분 X

모든 public benchmark 가 product evaluation 에 같은 치명적 결함: public 이라는 거. Model 이 비교에 쓰고 싶은 같은 질문 위에서 evaluate (그리고 자주 train 된다) 의미. 세 가지 구조적 문제 따라:

Contamination — model 이 training 에서 test 항목 봤음.
Goodhart — benchmark 알려지면 model lab 이 그것 specifically optimize.
Distribution mismatch — product 의 input distribution 이 benchmark 와 전혀 안 닮음.

왜 production 에서 custom benchmark 가 항상 이겨

Product 에 중요한 benchmark 는 product 자체의 eval suite. Public benchmark 는 명백히 약한 model 거르는 데 도움; custom benchmark 는 survivor 간 discriminate.

defensible custom benchmark 만드는 방법

실제 production traffic sample (privacy control 과 함께).
중요한 차원 (intent, language, length, difficulty) 에 걸쳐 stratify.
Reference 답 또는 rubric grading 있는 500-1000 case curate.
어떤 model fine-tuning 에도 절대 닿지 않는 100-200 case hold out.
코드처럼 version; 분기별 refresh.

원칙: Public benchmark 는 어떤 model 이 race 에 있는지 알려주고, custom benchmark 는 어떤 걸 ship 할지 알려줘.

Future-proofing 논거

다음 model 이 도착하면 ("Claude X.Y", "GPT-N+1"), custom benchmark 가 switch 할지 결정하게 해주는 유일한 artifact. 없으면 모든 model upgrade 가 leap of faith. 있으면 upgrade 가 numerical decision: suite 돌렸어, baseline 이겼어, ship.

Code

Custom benchmark 구조·json

{
  "benchmark": "acme-product-v3",
  "version": "2026-q2",
  "description": "500 cases sampled from production logs, stratified by intent and language",
  "strata": {
    "intent": ["qa", "navigation", "transactional", "exploratory"],
    "language": ["en", "es", "fr", "ja"],
    "difficulty": ["easy", "medium", "hard"]
  },
  "holdout_count": 100,
  "refresh_cadence": "quarterly",
  "contamination_policy": "no case ever appears in any fine-tuning corpus"
}

Public-benchmark contamination 감지·python

# A simple test: paraphrase a benchmark question and ask the model both versions.
# If it answers the original near-perfectly but the paraphrase poorly,
# memorization is the likely explanation.
import difflib

def paraphrase_test(model, original_q, paraphrased_q, expected):
    a1 = model.complete(original_q)
    a2 = model.complete(paraphrased_q)
    return {
        "original_correct": expected in a1.lower(),
        "paraphrase_correct": expected in a2.lower(),
        "answer_similarity": difflib.SequenceMatcher(None, a1, a2).ratio(),
    }

Benchmark 한계와 Custom Benchmark

왜 public benchmark 가 충분 X

왜 production 에서 custom benchmark 가 항상 이겨

defensible custom benchmark 만드는 방법

Future-proofing 논거

Code

External links

Exercise

Progress

댓글 0