고전 ML, Deep Learning, LLM, 그리고 RAG

접근법 지도

고전 ML(linear, tree, gradient boosting)은 tabular와 production scoring 시스템 대부분을 지배. Deep learning은 perception(vision, audio, text) — feature를 hand-craft 하기 어려운 곳을 지배. LLM은 unstructured text 위 instruction-following에 탁월. RAG는 LLM에 search index를 접목해서 답이 LLM의 frozen weight가 아니라 너의 데이터에서 나오게 해.

tabular는 여전히 누가 이기나

대부분 enterprise scoring / ranking / risk 문제는 gradient boosted tree(xgboost, lightgbm, catboost)가 신경망보다 적은 컴퓨트와 훨씬 적은 plumbing으로 이겨. deep learning은 input이 자연스럽게 tensor(image, waveform, sequence)일 때나 downstream 고전 모델에 먹일 embedding이 필요할 때 reach.

LLM을 쓸 때

LLM은 labeled data 없는 task, 코드 짜는 것보다 instruction 쓰는 게 쉬운 task, 0 training cost로 95% 정확도가 몇 달 labeling 후 98% 정확도를 이기는 task에서 빛나. structured data 위 high-precision scoring에는 잘못된 도구이고, hallucinated 답의 비용이 큰 어떤 문제에도 잘못된 도구.

Code

데이터 row의 모양에 맞춰 도구 고르기·python

def pick_approach(row):
    if row.input_kind == "tabular" and row.label_count > 1000:
        return "gradient_boosted_trees"
    if row.input_kind in {"image", "audio", "raw_text"}:
        return "deep_learning"
    if row.input_kind == "unstructured_text" and row.labels_scarce:
        return "llm_few_shot"
    if row.needs_facts_from_corpus:
        return "rag"
    return "clarify_problem"

LLM 가기 전 고전 baseline·python

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

tfidf_lr = Pipeline([
    ("tfidf", TfidfVectorizer(min_df=5, ngram_range=(1, 2))),
    ("clf", LogisticRegression(max_iter=1000)),
]).fit(X_text_train, y_train)

접근법 지도

tabular는 여전히 누가 이기나

LLM을 쓸 때

Code

External links

Exercise

Progress

댓글 0