Clustering

~28 min · clustering, kmeans, dbscan

Level 0Scout

0 XP0/48 lessons0/11 achievements

0/120 XP to next level120 XP to go0% complete

clustering이 뭘 위해 쓰이나

Clustering이 어떤 거리 아래 비슷해 보이는 example을 group. 잘 쓰면 customer segment, anomaly 후보, downstream model을 위한 compact feature 생성. 잘못 쓰면 비즈니스가 신경 안 쓰는 것에 맞는 예쁜 chart 생성.

세 직관의 세 알고리즘

K-Means — 비슷한 크기의 K개 convex blob 가정. 빠르고 수백만 row scale.
DBSCAN / HDBSCAN — density-based, 임의 모양 발견, noise 처리, "unclustered" row를 그대로 둠.
Agglomerative — bottom-up hierarchy. interpretability 중요한 작은 dataset에 좋음.

K 고르기 (또는 K가 자기 고르게)

K-Means엔 inertia vs K(elbow)와 silhouette score plot, elbow 굽는 K 고름. HDBSCAN엔 비즈니스에 중요할 만한 것에 따라 min_cluster_size 설정("진짜 segment는 최소 200 customer"). cluster는 항상 example inspection으로 검증, score만 믿지 마.

Code

elbow와 silhouette diagnostic 동반한 K-Means·python

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

Xs = StandardScaler().fit_transform(X)
for k in range(2, 11):
    km = KMeans(n_clusters=k, n_init="auto", random_state=7).fit(Xs)
    sil = silhouette_score(Xs, km.labels_)
    print(f"k={k}  inertia={km.inertia_:,.0f}  silhouette={sil:.3f}")

noise 동반 density-based segmentation을 위한 HDBSCAN·python

import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=200, min_samples=5)
labels = clusterer.fit_predict(Xs)
print("clusters:", len(set(labels)) - (1 if -1 in labels else 0))
print("noise:", (labels == -1).mean())

External links

Exercise

K-Means와 HDBSCAN으로 dataset cluster. 각 cluster에서 10명 sampling 해서 한 문장 persona 적기. cluster가 진짜 비즈니스 질문에 답하는지 feature artifact인지 결정.

Progress

Progress is local-only — sign in to sync across devices.

← 🌲 Model Selection, Tree, Boosting퀴즈 · 4 questions Next →Dimensionality Reduction

이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

🔔 답글 알림 (로그인 필요)

로그인 — 댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.