FastAPI Streaming Proxy

왜 proxy 하는가

Production 앱이 Gemini proxy 앞단에 두고 싶어하는 이유 두 가지:

API key 숨기기. Browser 가 credential 절대 못 봄. 손상된 browser 의 blast radius 가 "key 로 Gemini 직접 호출" 에서 "너 proxy 호출" 로 변함.
비즈니스 로직 추가. Auth, rate-limiting, request validation, response logging, model selection — server-side 한 곳에 모음.

패턴

FastAPI 의 StreamingResponse + httpx.AsyncClient.stream(). Gemini 에 stream 열고 client 에 chunk 포워딩, 끝나면 닫기. 총: 동작하는 proxy 약 30 줄.

SSE vs raw passthrough

설계 두 가지:

Raw passthrough: Gemini 의 byte 직접 포워딩. 가장 싸지만 client 가 Gemini 의 정확한 포맷 알아야.
너 자신의 SSE 포맷으로 re-emit: 각 chunk parse, text 추출, 너 자신의 data: {"text": "..."} event emit. 더 일이지만 client 를 Gemini 의 진화하는 schema 와 decouple.

Code

Raw passthrough proxy·python

import httpx
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse

app = FastAPI()
GEMINI_API_KEY = '...'  # from env in real code
BASE = 'https://generativelanguage.googleapis.com/v1beta'

@app.post('/v1beta/models/{model}:streamGenerateContent')
async def proxy_stream(model: str, request: Request):
    body = await request.body()
    url = f'{BASE}/models/{model}:streamGenerateContent?alt=sse'

    async def stream_gen():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream(
                'POST', url,
                headers={
                    'x-goog-api-key': GEMINI_API_KEY,
                    'Content-Type': 'application/json',
                },
                content=body,
            ) as response:
                async for chunk in response.aiter_bytes():
                    yield chunk

    return StreamingResponse(
        stream_gen(),
        media_type='text/event-stream',
    )

너 자신의 SSE 프로토콜로 re-emit·python

import json
from google import genai
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()
client = genai.Client()  # uses env GEMINI_API_KEY

@app.post('/api/chat')
async def chat_stream(req: dict):
    prompt = req['prompt']

    async def stream_gen():
        try:
            async for chunk in await client.aio.models.generate_content_stream(
                model='gemini-2.5-flash', contents=prompt,
            ):
                if chunk.text:
                    payload = json.dumps({'type': 'text', 'data': chunk.text})
                    yield f'data: {payload}\n\n'
                if chunk.usage_metadata:
                    usage = {
                        'prompt': chunk.usage_metadata.prompt_token_count,
                        'completion': chunk.usage_metadata.candidates_token_count,
                    }
                    payload = json.dumps({'type': 'usage', 'data': usage})
                    yield f'data: {payload}\n\n'
            yield 'data: {"type": "done"}\n\n'
        except Exception as e:
            err = json.dumps({'type': 'error', 'data': str(e)})
            yield f'data: {err}\n\n'

    return StreamingResponse(stream_gen(), media_type='text/event-stream')

Browser 측 — EventSource consumer·typescript

const es = new EventSource('/api/chat?prompt=' + encodeURIComponent(prompt));

es.onmessage = (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type === 'text') {
    appendToReply(msg.data);
  } else if (msg.type === 'usage') {
    showUsage(msg.data);
  } else if (msg.type === 'done') {
    es.close();
  } else if (msg.type === 'error') {
    showError(msg.data);
    es.close();
  }
};

Exercise

두 번째 코드 블록의 re-emit proxy 작성. localhost:8000 에서 실행하고 textarea + button 있는 작은 정적 HTML 페이지 작성, 거기 POST 해서 stream 된 답변 token-by-token 으로 render. proxy 에 chunk 당 50ms 인공 지연 추가하고 UI 가 부드럽게 업데이트 되는지 보기 — chunk 가 실시간 도착 (batch X) 확인.

왜 proxy 하는가

패턴

SSE vs raw passthrough

Code

External links

Exercise

Progress

댓글 0