Random Forest

bagging 레시피

Random forest가 bootstrap sample 위에 많은 decision tree 키움, 각 tree가 random feature subset 위에서 split. prediction은 평균(regression)이나 vote(classification). bagging이 variance 극적으로 낮추면서 bias는 단일 tree와 거의 같게 유지. random feature subsetting이 tree들을 decorrelate해서 variance 감소가 진짜.

튜닝할 것 (그리고 안 할 것)

n_estimators — 보통 더 많을수록 더 좋음, 컴퓨트 다할 때까지. 몇 백 넘으면 diminishing return.
max_features — classification은 sqrt, regression은 1/3이 sane default.
max_depth, min_samples_leaf — 개별 tree 복잡도 control. default가 놀랍게 자주 잘 동작.
Out-of-bag (OOB) score — generalization 무료 추정. oob_score=True로 켜.

언제 reach

Random forest가 logistic regression 위 "그냥 동작하는" tabular baseline. 최소 튜닝으로 강한 out-of-the-box performance 원하고 약간 큰 메모리 견딜 수 있을 때 사용. leaderboard 정상엔 gradient boosting이 보통 살짝 이김.

Code

OOB와 balanced class 동반한 random forest·python

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=400, max_depth=None, max_features="sqrt",
    class_weight="balanced", oob_score=True, n_jobs=-1, random_state=7
)
rf.fit(X_train, y_train)
print("OOB score:", rf.oob_score_)

Permutation importance가 feature_importances_보다 정직·python

from sklearn.inspection import permutation_importance

perm = permutation_importance(rf, X_val, y_val, n_repeats=20, random_state=7, n_jobs=-1)
for name, score in sorted(zip(X_val.columns, perm.importances_mean), key=lambda x: -x[1])[:10]:
    print(f"{score:+.4f}  {name}")

bagging 레시피

튜닝할 것 (그리고 안 할 것)

언제 reach

Code

External links

Exercise

Progress

댓글 0