Class imbalance 처리

imbalance 가 accuracy 를 깨먹는 이유

한 class 가 압도적이면 — 예를 들어 negative 95%, positive 5% — 모델이 아무 쓸모 없는 짓으로 95% accuracy 를 찍어: 그냥 항상 'negative' 라고 답하면 돼. rare class 가 gradient 에 기여하는 example 이 너무 적어서 loss 가 거의 안 혼내. fraud / 질병 / 결함 탐지의 전형적 함정이고, 그래서 imbalanced 데이터에선 accuracy 가 거짓말쟁이. minority class 의 precision / recall 을 봐야 진짜 무슨 일이 일어나는지 보여.

레버 셋, 비용 순으로

싼 것부터 손대. (1) class_weight — fit() 에 넘기는 dict 으로 각 class 의 loss 기여를 scale, rare class 실수가 더 아프게. 공짜, 한 줄, 먼저 시도. (2) sample_weight — class 단위보다 더 세밀한 제어가 필요할 때 per-example weight array. (3) resampling — minority oversample (복제 또는 SMOTE 로 합성) 또는 majority undersample, fit() 이 데이터 보기 전 pipeline 에서.

inference 단계의 레버 하나 더

weighting 한 뒤에도 *decision threshold* 를 튜닝할 수 있어. binary classifier 는 0.5 넘으면 'positive' 라 부르는 게 기본인데, imbalanced 데이터에선 이걸 0.3 으로 낮추면 precision 을 recall 과 맞바꿔 — positive 놓치는 비용 (놓친 질병) 이 false alarm 보다 훨씬 클 때 유용. threshold 튜닝은 사후 + 공짜라서 위의 학습 시 레버들과 같이 쓰기 좋아. 한 가지로 안 풀리는 경우 많아 — 둘 이상 조합이 흔해.

Code

imbalance 대응 weighting / resampling 세 방법·python

# 1. class_weight in model.fit()
model.fit(x, y, class_weight={0: 1.0, 1: 10.0})

# 2. sample_weight array (per-sample importance)
weights = np.where(y_train == 1, 10.0, 1.0)
model.fit(x, y, sample_weight=weights)

# 3. Oversample minority class in the pipeline
# (duplicate minority samples or use SMOTE)

imbalance 가 accuracy 를 깨먹는 이유

레버 셋, 비용 순으로

inference 단계의 레버 하나 더

Code

External links

Exercise

Progress

댓글 0