Golden Dataset

Golden dataset 이 ground truth

'Golden' dataset 은 각 input 이 인간들 — domain expert, annotator, product owner — 에게 축복받은 reference output 과 짝지어진 거야. grader 가 비교하는 reference frame 이야. 만드는 건 느리고, 비싸고, 대체 불가능해.

Dataset 을 'golden' 으로 만드는 것

검증된 reference — 적어도 두 명의 독립적 인간이 옳은 output 에 동의함. 단일 annotator reference 는 한 사람의 편향 상속.
Edge case coverage — easy case 만 X. long input, adversarial input, cross-language input, format-shifting input.
Versioning — dataset 이 git (또는 dataset platform) 에 check in 돼서 각 case 가 언제, 왜, 누구에 의해 변경됐는지 알 수 있음.
Provenance — 각 case 별로 어디서 왔는지 (production log, expert, synthetic) 알고 포함을 defend 할 수 있음.

원칙: Golden dataset 은 만드는 가장 비싼 것이고 나중에 다시 살 수 없는 유일한 eval infrastructure 야. database 처럼 다뤄, scratch file 처럼 X.

Inter-annotator agreement 가 중요해

두 annotator 가 30% case 에서 'correct' 답에 disagree 하면 golden dataset 은 golden 이 아니야 — 인간 disagreement 의 측정이야. Cohen's kappa 또는 Krippendorff's alpha 계산해. agreement 가 ~0.7 아래면 task 정의가 너무 모호해. labeling 계속 전에 rubric 다시 써.

Leakage 조심해

golden set 이 evaluate 하는 model fine-tune 에 쓰였다면 eval 은 무의미해 — model 이 답 외워. 항상 dataset 의 일부분을 training 에 절대 안 닿는 hold out 으로 유지해. 팀들이 dataset 을 무심코 돌릴 때 이게 들리는 것보다 어려워.

Code

Golden dataset entry — input + reference 너머·json

{
  "id": "qa.012",
  "input": "What was the population of Tokyo in 2024?",
  "reference": "Approximately 14 million in the prefecture; 37 million in the wider metropolitan area.",
  "acceptable_alternatives": [
    "~14 million in the city, ~37 million metro",
    "14 million people lived in Tokyo in 2024"
  ],
  "tags": ["qa", "factual", "date-sensitive"],
  "annotators": ["alice", "bob"],
  "agreement": 1.0,
  "source": "production_log",
  "provenance": "sampled 2026-04-15, anonymized",
  "version": 3
}

Inter-annotator agreement check (Cohen's kappa)·python

from sklearn.metrics import cohen_kappa_score

# Two annotators rate 100 outputs on a binary 'correct or not'
alice = [1, 0, 1, 1, 0, 1, ...]  # length 100
bob   = [1, 0, 1, 0, 0, 1, ...]

kappa = cohen_kappa_score(alice, bob)
print(f"Cohen's kappa = {kappa:.3f}")
# < 0.4 — poor agreement, your task definition is fuzzy
# 0.4-0.6 — moderate, worth tightening rubric
# 0.6-0.8 — substantial, reasonable to label more
# > 0.8 — almost perfect, ready for production

Golden dataset 이 ground truth

Dataset 을 'golden' 으로 만드는 것

Inter-annotator agreement 가 중요해

Leakage 조심해

Code

External links

Exercise

Progress

댓글 0