데이터 정제, 검증, Split

정제 체크리스트

중복 제거 — 정확 중복, near-duplicate. Near-dupe 감지(MinHash, embedding cosine)가 문자열 동등성이 놓치는 거 잡아.
포맷 일관성 — 모든 예제 같은 스키마. 두 포맷 섞으면 모델이 둘 다 잘못 배워.
길이 필터 — 너무 짧은 거(사소함)랑 너무 긴 거(노이즈 / 오버플로우) 제거.
품질 필터 — 문법 에러, 환각 콘텐츠, 깨진 JSON, 짝 안 맞는 code fence 제거.
토큰화 체크 — 문자 휴리스틱 말고 실제 토크나이저로 토큰 카운트. 모든 예제가 학습 컨텍스트 윈도우에 들어가는지 검증.

3-way split

Split	비율	용도
Train	~80%	모델이 실제 학습하는 것.
Validation	~10%	학습 중 모니터링. Overfitting 감지; early stopping 정보 제공.
Test	~10%	학습 끝날 때까지 완전히 빼둠. 최종 unbiased 숫자.

Validation이 카나리아인 이유

Training loss는 계속 떨어지는데 validation loss가 오르기 시작하면, 모델이 행동 학습이 아니라 행 암기 중. 이 발산이 overfitting 신호 — 학습 멈출 순간이지 "loss 낮아 보인다" 시점이 아냐.

Code

Defensive validator + 80/10/10 split for chat-format JSONL·python

import json, hashlib, random

def validate(filepath: str) -> tuple[list, list]:
    examples, errors, seen = [], [], set()
    with open(filepath) as f:
        for i, line in enumerate(f, 1):
            try:
                ex = json.loads(line)
            except json.JSONDecodeError as e:
                errors.append(f"line {i}: invalid JSON ({e})"); continue
            msgs = ex.get("messages")
            if not msgs:
                errors.append(f"line {i}: missing 'messages'"); continue
            if msgs[-1].get("role") != "assistant":
                errors.append(f"line {i}: last message must be assistant"); continue
            for j, m in enumerate(msgs):
                if not (m.get("content") or "").strip():
                    errors.append(f"line {i} msg {j}: empty content")
            h = hashlib.sha1(line.encode()).hexdigest()
            if h in seen:
                errors.append(f"line {i}: exact duplicate"); continue
            seen.add(h); examples.append(ex)
    return examples, errors

ex, errs = validate("clean.jsonl")
print(f"kept: {len(ex)}, errors: {len(errs)}")
random.Random(42).shuffle(ex)
n = len(ex)
splits = {"train": ex[:int(n*0.8)],
          "val":   ex[int(n*0.8):int(n*0.9)],
          "test":  ex[int(n*0.9):]}
for name, rows in splits.items():
    with open(f"{name}.jsonl", "w") as f:
        for r in rows: f.write(json.dumps(r) + "\n")
    print(f"{name}: {len(rows)}")

데이터 정제, 검증, Split

정제 체크리스트

3-way split

Validation이 카나리아인 이유

Code

External links

Exercise

Progress

댓글 0