Annotation Guideline 과 Versioning

Annotation rubric 이 contract

'이 output 좋아?' 는 annotator 가 답할 수 있는 질문이 아니야. '이 output 이 50단어 미만으로 정식 영어로 적어도 한 source 인용하면서 옳은 답을 진술해?' 는 가능. 이 두 prompt 차이가 유용한 golden dataset 과 disagreement 폴더의 차이야.

좋은 rubric 이 담는 것

정확한 질문 — annotator 가 답할 — binary 또는 짧은 scale 로 표현.
Worked example — pass output 과 fail output 의 verdict + reasoning 와 함께.
Edge case 결정 — 빈 output, refusal, 부분 답, off-topic 하지만 correct 답 어떻게 할지.
Out-of-scope 조항 — rating 에 영향 주면 안 되는 것.
Calibration set — 모든 annotator 가 먼저 label 하는 case 10-20개. 이 위 agreement 가 rubric 이해 증명.

원칙: 두 annotator 가 disagree 하면 잘못된 건 rubric 이지 annotator 가 아니야. agreement 까지 다시 써.

Dataset 을 코드처럼 versioning

Dataset 은 진화해. 모든 변경 — 새 case 추가, label 수정, edge case 도입 — 은 commit message 있는 commit 이어야 해. 구조적 선택 셋, 다 valid:

Plain git — repo 안 JSONL 파일, PR 로 review. 가장 간단, ~10K case 까지 scale.
DVC / lakeFS — large dataset 용 git, diffing tool 포함.
Platform-managed — Braintrust, Hugging Face Datasets, Argilla. Versioning + UI + 역할 기반 access.

What 이 아니라 why 추적

'qa.043 reference 업데이트' 는 무용지물 commit message 야. 'qa.043 업데이트: 원래 reference 가 outdated; provenance link 따라 Tokyo 인구 2024 figure 로 수정' 이 유용. regression 디버깅하는 미래의 네가 현재의 너한테 trail 남긴 거 고마워할 거야.

Code

Rubric template — markdown, dataset 옆에 거주·markdown

# rubric.md

## Question
Does the assistant's answer correctly summarize the article in <= 100 words?

## Pass criteria (all must hold)
- Mentions the central claim of the article
- Mentions the strongest supporting evidence
- Word count <= 100
- No factual claims absent from the article

## Fail examples
- Summary contains a fact not in the article (hallucination)
- Word count exceeds 100
- Summary changes the central claim

## Edge cases
- If the article has multiple central claims, mentioning either passes
- A refusal ("I cannot summarize this") fails
- Stylistic choices (formal vs casual) do NOT influence the rating

## Calibration set
Label cases qa.001 through qa.020 first. Discuss disagreements before continuing.

Dataset version log entry·json

{
  "version": 7,
  "date": "2026-04-15",
  "author": "alice",
  "changes": [
    {"id": "qa.043", "action": "updated_reference", "reason": "Tokyo population revised to 2024 figures", "prev_value": "~13.5M", "new_value": "~14M city, ~37M metro"},
    {"id": "qa.108", "action": "added", "source": "production_log", "date_collected": "2026-04-12"},
    {"id": "qa.077", "action": "removed", "reason": "duplicate of qa.034"}
  ]
}

Annotation Guideline 과 Versioning

Annotation rubric 이 contract

좋은 rubric 이 담는 것

Dataset 을 코드처럼 versioning

What 이 아니라 why 추적

Code

External links

Exercise

Progress

댓글 0