Golden set — 만들고 유지

~16 min · evaluation, golden-sets

Level 0수련생

0 XP0/100 lessons0/14 achievements

0/120 XP to next level120 XP to go0% complete

Curated, collected 아님

Golden set은 known-correct output 박힌 hand-curated example. 너의 traffic의 가장 작고 carefully 고른 subset, 모든 변경 validate에 사용. 대부분 프롬프트엔 50 case 충분; high-stakes system은 200–500.

Construction

Source — intent / category로 stratify해서 진짜 traffic sample.
Reference output — 너의 best human reviewer나 current best 모델이 produce.
Annotation — 왜 이 output이 correct; 대안 correct output은 어떻게 생겼나.
Versioning — golden set이 evolve; version tag, build된 version에 새 prompt 돌려.

Maintenance

production이 새 failure mode surface하면 새 case 추가.
underlying behavior가 deprecated되면 case retire.
매년 reference output review — 작년 correct가 지금 wrong일 수 있어.
representativeness audit — set이 traffic distribution 여전히 match?

Code

JSONL로 golden set·json

{"id": "refund-clear-policy-1", "category": "refund", "input": "...", "reference": "...", "reasoning": "30-day window; eligible.", "created": "2026-04-12", "reviewer": "alice"}
{"id": "refund-edge-shipping-2", "category": "refund", "input": "...", "reference": "...", "reasoning": "Damaged on arrival; full refund regardless of window.", "created": "2026-04-12", "reviewer": "alice"}

External links

Exercise

한 prompt에 30-case golden set build. intent로 stratify. 각 reference output에 왜 correct인지 annotate. 현재 프롬프트 그것에 돌려.

Progress

Progress is local-only — sign in to sync across devices.

← Previous프롬프트 전 eval — 거꾸로 work이 이김 Next →Pairwise comparison — absolute scoring 거짓말할 때

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.