C.W.K.
Stream
Lesson 03 of 04 · published

Rate Limit 과 Fallback

~14 min · rate-limits, fallback, retry, resilience

Level 0Spark
0 XP0/35 lessons0/10 achievements
0/140 XP to next level140 XP to go0% complete

429 가 정상, 예외 X

너 429 hit. 너가 뭐 잘못해서가 아니라 rate limit 이 존재하니까. 옳은 반응이 jitter 와 함께 exponential backoff, 패닉 X. 잘못된 반응이 tight retry (악화) 또는 user 에 직접 에러 surface.

모델 가로질러 fallback chain

Pro 가 rate-limit 되면 Flash 는 보통 X. Flash 가 되면 Flash-Lite 는 보통 X. Rate-limit 실패에 비용 ladder 내려가는 chain 빌드.

Provider 가로질러 fallback chain

Cross-provider fallback (Gemini → OpenAI → Claude) 이 다음 layer. 같은 모양: 시도, 실패에 전환, log, surface.

Safety block 은 retry 아님

finishReason: SAFETY 가 모델이 출력 필터링 의미. 같은 prompt retry 하면 같은 결과. Rephrase 하든가, safety threshold 느슨화 (use case 정당화 시만), user 에 graceful 메시지 surface.

Code

Backoff 의 내성 있는 single 호출·python
import asyncio, random
from google.genai import errors

async def resilient_generate(client, model, contents, max_retries=4):
    for attempt in range(max_retries):
        try:
            return await client.aio.models.generate_content(
                model=model, contents=contents,
            )
        except errors.ClientError as e:
            if e.code == 429:
                # Rate limit — exponential backoff with jitter
                base = 2 ** attempt
                sleep = base + random.uniform(0, base * 0.5)
                print(f'[429 on {model}] sleeping {sleep:.1f}s')
                await asyncio.sleep(sleep)
                continue
            # 4xx other than 429 — don't retry
            raise
        except errors.ServerError:
            await asyncio.sleep(min(2 ** attempt, 30))
            continue
    raise RuntimeError(f'Gave up after {max_retries} retries')
Multi-model fallback chain·python
MODEL_CHAIN = [
    'gemini-2.5-pro',
    'gemini-2.5-flash',
    'gemini-2.5-flash-lite',
]

async def generate_with_fallback(client, contents):
    last_error = None
    for model in MODEL_CHAIN:
        try:
            return await client.aio.models.generate_content(
                model=model, contents=contents,
            ), model
        except errors.ClientError as e:
            if e.code == 429:
                last_error = e
                continue  # Try next model
            raise  # Other 4xx — don't shop around
        except errors.ServerError as e:
            last_error = e
            continue  # 5xx — try next

    raise RuntimeError(f'All models exhausted. Last error: {last_error}')
SAFETY 명시적 핸들·python
candidate = response.candidates[0]
if candidate.finish_reason.name == 'SAFETY':
    # Surface what got blocked, for logging
    for rating in candidate.safety_ratings:
        if rating.probability.name in ('HIGH', 'MEDIUM'):
            log.warning(f'Safety block: {rating.category.name} -> {rating.probability.name}')

    # Don't retry. Surface gracefully.
    return 'Sorry, I can't help with that request.'
Cross-provider fallback (sketch)·python
ADAPTER_CHAIN: list[ModelAdapter] = [
    GeminiAdapter(api_key=GEMINI_KEY),
    OpenAIAdapter(api_key=OPENAI_KEY),     # if you have one
    OllamaAdapter(model='llama3.1:70b'),   # local fallback
]

async def stream_with_provider_fallback(messages):
    for adapter in ADAPTER_CHAIN:
        try:
            async for chunk in adapter.generate_stream(messages):
                yield chunk
            return
        except Exception as e:
            # Toast it visibly — never silent
            print(f'[{adapter.name()} failed: {e}. Trying next provider.]')
            continue
    raise RuntimeError('All providers exhausted')

External links

Exercise

두 번째 코드 블록의 multi-model fallback chain 빌드. Rate-limit 강제 (quota 안 쌓인 fresh 프로젝트에서 Pro 100x tight loop). Chain 이 적절한 logging 과 함께 Flash, 그 다음 Flash-Lite 로 fall through 확인. 보너스: 모델별 cooldown timer 추가 — 모델 한 번 rate-limit 되면 60 초 동안 시도 안 하고 skip.

Progress

Progress is local-only — sign in to sync across devices.
이 페이지에서 버그를 발견하셨거나 피드백이 있으세요?문제 신고

댓글 0

🔔 답글 알림 (로그인 필요)
로그인댓글을 남기려면 로그인해 주세요.

아직 댓글이 없어요. 첫 댓글을 남겨보세요.