Test Dataset 이 뭐야?

Dataset 은 천장

Eval suite 는 그게 돌아가는 dataset 만큼만 좋아. grader 완벽해도, framework 최고여도, judge 가 GPT-5 여도 — input 이 실제 workload 를 대표 못 하면 다 무의미해.

Test dataset 은 의도적으로 curate 된 input 모음이고, 선택적으로 reference (good 이 뭔지) 와 metadata tag (나중에 suite slice 할 수 있게) 가 짝지어져. 구조는 작아 보여. 만드는 규율이 실제 작업이야.

Dataset 셋, 역할 셋

Smoke set — happy path 운동시키는 5-20 case. 초 단위로 돌아. push 전에 'pipeline 깨먹었어' bug 잡아.
Regression set — 보호하기로 commit 한 동작들 cover 하는 100-500 case. 매 PR 에 돌아. silent quality regression 잡아.
Adversarial / edge-case set — 어렵고, 이상하고, 악의적이거나, low-frequency 한 case. 덜 자주 돌려. user 가 부딪힐 때 망가지는 실패 잡아.

원칙: Coverage 가 size 보다 중요해. 20개 distinct 동작 cover 하는 200-case dataset 이 happy path 하나만 주로 test 하는 5,000-case dataset 이겨.

데이터가 어디서 와

Production log — 실제 user sample (privacy control 과 함께). 가장 representative 한 source.
Domain expert — 네가 아직 만나지 못한 실패 mode 알아.
Synthetic generation — LLM 이 input 을 scale 로 fabricate. adversarial set 과 rare 조합에 유용.
Public benchmark — foundational capability check 에 유용, product-specific eval 엔 절대 충분 X.

Code

JSONL dataset 형식 — canonical eval shape·json

{"id": "qa.001", "input": "What is the capital of France?", "reference": "Paris", "tags": ["qa", "easy", "english"]}
{"id": "qa.002", "input": "Who wrote Hamlet?", "reference": "William Shakespeare", "tags": ["qa", "easy", "english"]}
{"id": "qa.003", "input": "Capital of South Sudan?", "reference": "Juba", "tags": ["qa", "medium", "geography"]}
{"id": "qa.004", "input": "En quelle année est mort Napoléon?", "reference": "1821", "tags": ["qa", "french", "date"]}

Run 위해 load 와 stratify·python

import json
from collections import Counter

def load_jsonl(path):
    with open(path) as f:
        return [json.loads(line) for line in f]

dataset = load_jsonl("datasets/regression.jsonl")
print(f"loaded {len(dataset)} cases")
print("tag distribution:", Counter(t for case in dataset for t in case["tags"]).most_common())

# Stratify — sample evenly across tags rather than picking the first 50
def stratified_sample(dataset, n_per_tag=10):
    by_tag = {}
    for case in dataset:
        for tag in case["tags"]:
            by_tag.setdefault(tag, []).append(case)
    out = []
    seen = set()
    for tag, cases in by_tag.items():
        for c in cases[:n_per_tag]:
            if c["id"] not in seen:
                out.append(c)
                seen.add(c["id"])
    return out

Test Dataset 이 뭐야?

Dataset 은 천장

Dataset 셋, 역할 셋

데이터가 어디서 와

Code

External links

Exercise

Progress

댓글 0