ML Task Family

Task 이름이 수식을 결정한다

ML 프로젝트의 첫 결정은 task family를 이름 짓는 거야. family가 target 모양, metric, split 전략, baseline을 다 고정해. 분류학 trivia가 아니라, evaluation이 정직한지 결정하는 계약이야.

가장 자주 보는 일곱 family

Regression — 연속 스케일의 숫자 (가격, demand, latency).
Binary classification — 둘 중 하나, 보통 확률 포함 (churn, fraud, click).
Multi-class classification — K개 중 하나 (digit, intent, category).
Multi-label classification — K개의 임의 부분집합 (article tags).
Ranking — 사용자에게 보여줄 순서 (search, recommendation).
Forecasting — 과거에서 미래를 예측 (sales, capacity).
Anomaly detection — 드물거나 이상한 패턴 (fraud, machine failure).

family를 잘못 골랐을 때의 함정

최종 UI가 yes/no라고 모든 걸 classification으로 부르는 게 가장 흔한 framing 실수야. 사실은 ranking (top-10 leads), forecasting (다음 주 capacity), scoring (0~1 risk) 인 경우가 많아. family 잘못 고르면 metric이 조용히 오염돼.

Code

한 팀의 백로그를 task family로 매핑·python

task_map = {
    "house_price_estimate": "regression",
    "will_user_churn": "binary_classification",
    "support_ticket_topic": "multi_class_classification",
    "article_tags": "multi_label_classification",
    "search_results": "ranking",
    "weekly_demand": "forecasting",
    "machine_failure_alert": "anomaly_detection",
}

같은 데이터, 다른 family, 다른 metric·python

# Treat as classification: did the lead convert?
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_converted, p_convert)

# Treat as ranking: are the top-K leads the right ones?
top_k_idx = p_convert.argsort()[::-1][:50]
precision_at_50 = y_converted.iloc[top_k_idx].mean()

Task 이름이 수식을 결정한다

가장 자주 보는 일곱 family

family를 잘못 골랐을 때의 함정

Code

External links

Exercise

Progress

댓글 0