Local AI fallback ship하기

이 마지막 lesson 모양

Quest를 묶어서 진짜 한 가지 ship — cloud provider 골라서 user prompt 답하는 작은 backend service에 cloud unreachable일 때 transparent fallback인 Ollama. 이게 모든 이전 track의 패턴들의 가장 작은 production-shape proof.

Production check-list

Adapter 패턴. Cloud랑 Ollama 둘 다 stream() 구현.
Health check. 둘 다 빠른 timeout 가진 health_check().
Orchestrator. Cloud 먼저 시도; 실패나 degraded health에 Ollama로 fallback.
Streaming. 어느 vessel이 답하든 frontend는 한 끊김 없는 SSE stream 봄.
Pinned 모델 tag. Ollama랑 cloud 모델 버전 둘 다 명시.
Observability. 어느 vessel이 답했는지 매 request 로깅. 나중에 일부 답이 다르게 느껴질 때 이거 원할 거야.
Mini mode. Local 모델이 작으면 system prompt stripped down.
Warm-up. Startup task가 keep_alive: -1로 local 모델 pre-load해서 첫 fallback latency 수용 가능하게.

이게 동작할 때 만든 거

Privacy-aware, cost-aware, outage-aware 서비스. Cloud 다운 → user 계속 답 받음. Cloud rate-limit → user 계속 답 받음. Cloud 비싸짐 → bulk batch job이 local로 라우팅. Orchestrator는 단순하고 adapter는 narrow; 그래서 robust.

Code

End-to-end skeleton (FastAPI + orchestrator)·python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json, asyncio

app = FastAPI()

# Startup에서 orchestrator 빌드; local fallback pre-warm
ORCHESTRATOR = AIOrchestrator()

@app.on_event("startup")
async def warm_local():
    # Local fallback 모델 pre-load해서 첫 cold-fallback이 30초 wait 안 되게
    try:
        await ORCHESTRATOR.vessels["ollama"].client.post(
            "http://localhost:11434/api/generate",
            json={"model": "qwen2.5:7b", "keep_alive": "-1"},
            timeout=120.0,
        )
    except Exception:
        pass  # local 안 깔려있어도 ok

@app.post("/chat/stream")
async def chat_stream(req: dict):
    """User에게 답 stream, cloud→local fallback 가지고."""
    chosen_vessel = {"name": None}

    async def gen():
        async for chunk in ORCHESTRATOR.stream(req["messages"]):
            if chosen_vessel["name"] is None:
                chosen_vessel["name"] = chunk.model
            payload = {"content": chunk.content, "done": chunk.done}
            yield f"data: {json.dumps(payload)}\n\n"
        # 어느 vessel이 답했는지 로그 (ops 용)
        print(f"[/chat/stream] vessel={chosen_vessel['name']}")

    return StreamingResponse(gen(), media_type="text/event-stream")

# 실행: uvicorn main:app --host 0.0.0.0 --port 9000

Local AI fallback ship하기

이 마지막 lesson 모양

Production check-list

이게 동작할 때 만든 거

Code

External links

Exercise

Progress

댓글 0