C.W.K.
Stream
Lesson 06 of 06 · published

MLOps — Experiment Tracking, Versioning, Monitoring

~12 min · mlops, wandb, mlflow, monitoring

Level 0Tensor 호기심
0 XP0/62 lessons0/13 achievements
0/120 XP to next level120 XP to go0% complete

production 에서 model 살아있게 하는 인프라

ship 후 그냥 거기 앉은 model 이 조용히 degrade. MLOps 가 ML model 을 다른 production system 처럼 다루는 practice: version, monitor, reality drift 시 retrain.

네 MLOps 기둥

  1. Experiment tracking — 모든 run 의 모든 hyperparameter, metric, artifact log. W&B, MLflow, Comet 이 인기 도구.
  2. Model versioning — model + training data + code + 환경 함께 저장. Git LFS, DVC, MLflow Model Registry, 또는 그냥 manifest file 의 blob storage.
  3. Production monitoring — latency, throughput, error rate, AND model 행동 (prediction distribution, input distribution) 추적. Drift 감지.
  4. Retraining trigger — monitoring signal 이 degradation 보이면 retrain 하는 자동 pipeline.

최소 viable setup

indie project 위 Kubernetes-managed Kubeflow 필요 없음. 필요:

  • experiment tracker 하나 (W&B 또는 MLflow). 하나 고르고 일관 사용.
  • manifest (date, dataset version, metric) 와 함께 저장된 model file.
  • input 과 output capture 하는 production logging (sample — 매 request log X).
  • production prediction distribution 을 training distribution 과 비교하는 weekly script. shift 에 alert.

그게 다. 큰 회사 MLOps stack 이 power 추가하지만 1-3 명 팀엔 overkill. 단순 시작, 필요 시 추가.

Code

Weights & Biases — 인기 tracker·python
# pip install wandb
import wandb

wandb.init(project="pytorch-quest", config={
    "lr": 1e-3,
    "epochs": 50,
    "batch_size": 32,
    "architecture": "ResNet50",
})

for epoch in range(50):
    train_loss = train_one_epoch(...)
    val_loss, val_acc = evaluate(...)

    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_loss": val_loss,
        "val_accuracy": val_acc,
        "learning_rate": optimizer.param_groups[0]['lr'],
    })

wandb.save("best_model.pth")    # ship the model artifact too
wandb.finish()
MLflow — open-source 대안·python
# pip install mlflow
import mlflow

mlflow.set_experiment("pytorch-quest")

with mlflow.start_run():
    mlflow.log_params({
        "lr": 1e-3,
        "epochs": 50,
        "model": "ResNet50",
    })

    for epoch in range(50):
        train_loss = train_one_epoch(...)
        mlflow.log_metric("train_loss", train_loss, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    # mlflow ui  → http://127.0.0.1:5000 to browse runs
Production monitoring — sampled request logging·python
import json
import random
import time
from pathlib import Path

LOG_DIR = Path("/var/log/pippa-pred")
LOG_DIR.mkdir(parents=True, exist_ok=True)

def log_prediction(input_data, prediction, sample_rate=0.01):
    """Log roughly 1% of predictions for offline analysis."""
    if random.random() > sample_rate:
        return
    record = {
        'ts': time.time(),
        'input_summary': summarize_input(input_data),     # avoid logging raw inputs if PII
        'prediction': prediction,
    }
    fname = LOG_DIR / f"{time.strftime('%Y%m%d')}.jsonl"
    with open(fname, 'a') as f:
        f.write(json.dumps(record) + '\n')

# In your serving code:
# log_prediction(req, response.dict())
CI/CD shape — GitHub Actions 예·python
# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 6 * * 0'       # weekly retrain on Sundays

jobs:
  train-evaluate-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - name: Install
        run: pip install -r requirements.txt
      - name: Tests
        run: pytest tests/
      - name: Train
        run: python train.py --config config.yaml
      - name: Evaluate
        run: python evaluate.py --threshold 0.90
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: model
          path: outputs/model.pt

External links

Exercise

W&B 또는 MLflow 고르기. 너가 짠 어떤 training script 든 추가. 다른 learning rate 3 개 위 sweep 돌리기. 결과 dashboard / UI 봐 — run 들의 시각적 diff 가 console output scroll 보다 훨씬 읽기 쉬움. 그 습관 자체가 5 분 setup 정당화.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.