비용 최적화 라우팅

가장 큰 lever 한 개

Pro 가 출력 토큰당 Flash-Lite 보다 ~22.5x 비싸. 너 트래픽 대부분 Pro 필요 X. 일상 작업을 Flash-Lite 로, default 작업을 Flash 로, 진짜 필요할 때만 Pro 로 보내는 smart router 가 보통 production 청구서 70–90% 줄여.

Routing 입력

결정이 다음 기반:

Reasoning 필요? Multi-step logic, code, math → Pro.
Context 크기? > 200K 토큰 → Pro (Pro 만 long context 잘 핸들; Flash 와 Flash-Lite 가 quality drop hit).
Tool 사용? 복잡한 multi-tool agentic loop → Pro. 단순 single-tool 호출 → Flash.
Latency? Sub-500ms 요구사항 → Flash-Lite (가장 작고, 가장 빠른 TTFT).
Volume? High-volume 단순 작업 → Flash-Lite.

실제 숫자

전략	상대 비용	언제
항상 Pro	1.0x	최대 quality, 비용 ceiling 없음
Smart routing	~0.3x	대부분 production 앱
Flash-Lite 만	~0.04x	High-volume 단순 작업
Caching 활성화	cached 호출당 ~0.1x	반복 context (PDF Q&A)
Batch API (offline)	0.5x	Async pipeline

Code

Smart router — minimal 버전·python

class SmartRouter:
    """Pick the cheapest capable Gemini model for a request."""

    def route(
        self,
        needs_tools: bool = False,
        needs_reasoning: bool = False,
        max_context: int = 0,
        latency_budget_ms: int = 10_000,
    ) -> str:
        # Tight latency budget — go straight to Flash-Lite
        if latency_budget_ms < 500:
            return 'gemini-2.5-flash-lite'

        # Reasoning or huge context — Pro is the only safe choice
        if needs_reasoning or max_context > 200_000:
            return 'gemini-2.5-pro'

        # Tools without reasoning — Flash handles fine
        if needs_tools or max_context > 50_000:
            return 'gemini-2.5-flash'

        # Simple high-volume — cheapest model
        return 'gemini-2.5-flash-lite'

router = SmartRouter()
model = router.route(
    needs_tools=True,
    needs_reasoning=False,
    max_context=80_000,
)
print(model)  # 'gemini-2.5-flash'

Inspect-then-pick — prompt 가 자기 분류·python

async def route_by_classification(prompt: str) -> str:
    """Use Flash-Lite to decide which model the real prompt deserves."""
    classifier = await client.aio.models.generate_content(
        model='gemini-2.5-flash-lite',
        contents=(
            'Classify the following user request into exactly one bucket: '
            'CHAT (simple Q&A), CODE (writing or debugging code), '
            'REASON (multi-step logic), or AGENT (multi-tool workflow). '
            f'Output only the bucket name.\n\nRequest: {prompt}'
        ),
        config={'max_output_tokens': 10, 'temperature': 0.0},
    )
    bucket = classifier.text.strip().upper()
    return {
        'CHAT':   'gemini-2.5-flash-lite',
        'CODE':   'gemini-2.5-flash',
        'REASON': 'gemini-2.5-pro',
        'AGENT':  'gemini-2.5-pro',
    }.get(bucket, 'gemini-2.5-flash')

Exercise

첫 코드 블록의 smart router 빌드. 이전 lesson 의 GeminiAdapter 에 wire 해서 모델이 construction 시 fixed 대신 호출당 선택. 100 mixed prompt (chat 일부, reasoning 일부, long context 일부) 실행하고 어느 모델 선택 됐는지 log. 항상-Pro 와 항상-Flash-Lite baseline 대비 비용 계산.