더 좋은 데이터가 더 화려한 모델을 이긴다

~28 min · data-centric, labeling, framing

Level 0Scout

0 XP0/48 lessons0/11 achievements

0/120 XP to next level120 XP to go0% complete

왜 데이터가 이기나

대부분 production 문제에서 bottleneck은 모델 architecture가 아냐. label 품질, label coverage, feature freshness, train-production 데이터 사이의 갭이 bottleneck이야. label에 일주일 쓰고 모델에 하루 쓰는 팀이, 보통 그 반대로 한 팀을 이겨.

leverage 큰 다섯 데이터 무브

판단 호출을 재현 가능한 rule로 바꾸는 labeling guide를 써.
annotator 간 disagreement를 sampling 해서 팀이 다시 label 해.
noise가 살 곳이 있도록 명시적 "모르겠음" / "애매함" label을 추가.
label 품질을 모델 품질만큼 추적해.
모델링 전에 rare class와 edge case를 사람이 직접 봐.

Andrew Ng의 data-centric flip

data-centric AI 메시지는, 모델 코드 고정한 채 label을 개선하는 게 종종 모델 튜닝을 이긴다는 거야. 함의는 label review가 어시스턴트 작업이 아니라 엔지니어링 작업이고, 엔지니어 시간이 들어가야 한다는 거.

Code

annotator 간 일치도 추적·python

from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(annotator_a, annotator_b)
print(f"agreement κ={kappa:.2f}  (>0.6 acceptable, >0.8 good)")

low-confidence prediction을 re-label 후보로 sampling·python

import numpy as np

probs = model.predict_proba(X_unlabeled)
uncertainty = 1 - np.abs(probs[:, 1] - 0.5) * 2
to_review = X_unlabeled.iloc[np.argsort(-uncertainty)[:200]]

External links

Exercise

프로젝트에서 labeled example 100개를 가져와. 두 번째 reviewer가 blind 로 다시 label 하게 해. Cohen's kappa를 계산. 0.7 미만이면, 다음 모델 train 전에 labeling guide부터 고쳐.

Progress

Progress is local-only — sign in to sync across devices.

← 🧠 머신러닝이 뭔가퀴즈 · 4 questions Next →Tabular Data 직관

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.