모델 관리 endpoint

Local fleet의 프로그램적 컨트롤

ollama CLI가 하는 거 다 HTTP로 가능해. 원격 머신에서 Ollama 돌릴 때, deploy 파이프라인에 통합할 때, 그 위에 UI 쌓을 때 의미 있어.

각 endpoint가 하는 일

GET /api/tags — 설치된 모델 list (size, modification time).
POST /api/show — 한 모델의 full metadata, template, parameter, capability.
POST /api/pull — 모델 다운로드. NDJSON progress event stream.
DELETE /api/delete — 모델 제거.
POST /api/embed — Embedding 생성 (single 또는 batch).
POST /api/copy — 모델 alias / 이름변경 (cheap, blob 공유).
GET /api/ps — 현재 로드된 모델 (VRAM 사용량 + TTL 포함).
GET /api/version — Daemon 버전 (health check 용).

Pull progress가 streaming NDJSON

대부분 management endpoint는 non-streaming인데 /api/pull은 NDJSON progress 줄 emit해 — {"status": "pulling manifest"}, 그다음 {"status": "pulling abc123", "completed": 1234567, "total": 5000000000} 같은 식. 줄당 completed / total 읽어서 progress bar 그려.

API로 load/unload

전용 load/unload endpoint는 없어. 패턴은: /api/generate에 빈 prompt랑 keep_alive 보내면 load 강제됨. 같은 endpoint에 keep_alive: 0 보내면 unload. Warmup 스크립트가 이렇게 돌아가.

Code

프로그램으로 load/unload 강제·python

import httpx

OLLAMA = "http://localhost:11434"

def load(model: str, keep_alive: str = "30m"):
    """모델을 메모리에 강제 로드."""
    httpx.post(f"{OLLAMA}/api/generate",
               json={"model": model, "keep_alive": keep_alive},
               timeout=120.0).raise_for_status()

def unload(model: str):
    """모델을 메모리에서 즉시 unload."""
    httpx.post(f"{OLLAMA}/api/generate",
               json={"model": model, "keep_alive": 0},
               timeout=10.0).raise_for_status()

def loaded() -> list[dict]:
    """지금 로드된 거."""
    return httpx.get(f"{OLLAMA}/api/ps").json().get("models", [])

# Warmup 패턴
load("qwen2.5:7b")
print(loaded())   # qwen2.5:7b가 size_vram이랑 expires_at이랑 같이 떠야

Progress 같이 pull·python

import httpx, json

with httpx.stream("POST", f"{OLLAMA}/api/pull",
                  json={"model": "gemma3:12b"},
                  timeout=None) as r:
    last_status = ""
    for line in r.iter_lines():
        if not line:
            continue
        evt = json.loads(line)
        status = evt.get("status", "")
        if "total" in evt and "completed" in evt:
            pct = 100 * evt["completed"] / max(evt["total"], 1)
            print(f"\r{status}: {pct:5.1f}%", end="", flush=True)
        elif status != last_status:
            print(f"\n{status}", end="")
            last_status = status

Local fleet의 프로그램적 컨트롤

각 endpoint가 하는 일

Pull progress가 streaming NDJSON

API로 load/unload

Code

External links

Exercise

Progress

댓글 0