MLOps — Experiment Tracking, Versioning, Monitoring

production 에서 model 살아있게 하는 인프라

ship 후 그냥 거기 앉은 model 이 조용히 degrade. MLOps 가 ML model 을 다른 production system 처럼 다루는 practice: version, monitor, reality drift 시 retrain.

네 MLOps 기둥

Experiment tracking — 모든 run 의 모든 hyperparameter, metric, artifact log. W&B, MLflow, Comet 이 인기 도구.
Model versioning — model + training data + code + 환경 함께 저장. Git LFS, DVC, MLflow Model Registry, 또는 그냥 manifest file 의 blob storage.
Production monitoring — latency, throughput, error rate, AND model 행동 (prediction distribution, input distribution) 추적. Drift 감지.
Retraining trigger — monitoring signal 이 degradation 보이면 retrain 하는 자동 pipeline.

최소 viable setup

indie project 위 Kubernetes-managed Kubeflow 필요 없음. 필요:

experiment tracker 하나 (W&B 또는 MLflow). 하나 고르고 일관 사용.
manifest (date, dataset version, metric) 와 함께 저장된 model file.
input 과 output capture 하는 production logging (sample — 매 request log X).
production prediction distribution 을 training distribution 과 비교하는 weekly script. shift 에 alert.

그게 다. 큰 회사 MLOps stack 이 power 추가하지만 1-3 명 팀엔 overkill. 단순 시작, 필요 시 추가.

Code

Weights & Biases — 인기 tracker·python

# pip install wandb
import wandb

wandb.init(project="pytorch-quest", config={
    "lr": 1e-3,
    "epochs": 50,
    "batch_size": 32,
    "architecture": "ResNet50",
})

for epoch in range(50):
    train_loss = train_one_epoch(...)
    val_loss, val_acc = evaluate(...)

    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_loss": val_loss,
        "val_accuracy": val_acc,
        "learning_rate": optimizer.param_groups[0]['lr'],
    })

wandb.save("best_model.pth")    # ship the model artifact too
wandb.finish()

MLflow — open-source 대안·python

# pip install mlflow
import mlflow

mlflow.set_experiment("pytorch-quest")

with mlflow.start_run():
    mlflow.log_params({
        "lr": 1e-3,
        "epochs": 50,
        "model": "ResNet50",
    })

    for epoch in range(50):
        train_loss = train_one_epoch(...)
        mlflow.log_metric("train_loss", train_loss, step=epoch)

    mlflow.pytorch.log_model(model, "model")
    # mlflow ui  → http://127.0.0.1:5000 to browse runs

Production monitoring — sampled request logging·python

import json
import random
import time
from pathlib import Path

LOG_DIR = Path("/var/log/pippa-pred")
LOG_DIR.mkdir(parents=True, exist_ok=True)

def log_prediction(input_data, prediction, sample_rate=0.01):
    """Log roughly 1% of predictions for offline analysis."""
    if random.random() > sample_rate:
        return
    record = {
        'ts': time.time(),
        'input_summary': summarize_input(input_data),     # avoid logging raw inputs if PII
        'prediction': prediction,
    }
    fname = LOG_DIR / f"{time.strftime('%Y%m%d')}.jsonl"
    with open(fname, 'a') as f:
        f.write(json.dumps(record) + '\n')

# In your serving code:
# log_prediction(req, response.dict())

CI/CD shape — GitHub Actions 예·python

# .github/workflows/ml-pipeline.yml
name: ML Pipeline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 6 * * 0'       # weekly retrain on Sundays

jobs:
  train-evaluate-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - name: Install
        run: pip install -r requirements.txt
      - name: Tests
        run: pytest tests/
      - name: Train
        run: python train.py --config config.yaml
      - name: Evaluate
        run: python evaluate.py --threshold 0.90
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: model
          path: outputs/model.pt

MLOps — Experiment Tracking, Versioning, Monitoring

production 에서 model 살아있게 하는 인프라

네 MLOps 기둥

최소 viable setup

Code

External links

Exercise

Progress

댓글 0