Streaming 이유와 SSE 포맷

Streaming 이 사주는 건 time-to-first-token

Gemini 2.5 Flash 의 time-to-first-token (TTFT) 보통 200–400ms. 500 단어 답변의 time-to-last-token 은 3–8 초. Streaming 이 모델을 빠르게 만드는 건 X — user 의 기다림을 빈 화면 대신 visible progress 로 만드는 거야.

Streaming 안 쓰는 product 비용은 "이 앱 느려" 와 "이 앱 살아있어" 의 차이. 같은 총 시간, 다른 perception.

Server-Sent Events — wire format

Gemini 의 streaming endpoint 가 Server-Sent Events (SSE) 사용. 각 event 는 data: 로 시작하는 line + complete JSON 객체 — non-streaming response 와 같은 모양인데 parts 배열에 text slice 만 들어있음.

Endpoint variant 두 가지

같은 URL, 다른 query string:

/streamGenerateContent (?alt=sse 없이) — buffered JSON array 반환. streaming chunk 받고 싶지만 SSE parse 하기 싫을 때.
/streamGenerateContent?alt=sse — SSE 반환. browser 로 stream 포워딩하거나 line-by-line parse 할 때.

Final chunk 가 usageMetadata 옮김

Token count 는 stream 의 마지막 chunk 에만 나타남. 호출 billing 이나 logging 하려면 모든 chunk 가로질러 text accumulate 하고 usageMetadata 들어있는 chunk 잡아.

Code

Raw SSE wire — socket 에서 나오는 거·text

POST .../models/gemini-2.5-flash:streamGenerateContent?alt=sse
x-goog-api-key: $GEMINI_API_KEY
Content-Type: application/json

# Response headers:
Content-Type: text/event-stream

# Response body (each block separated by blank line):
data: {"candidates":[{"content":{"parts":[{"text":"Hello"}],"role":"model"}}]}

data: {"candidates":[{"content":{"parts":[{"text":" world"}],"role":"model"}}]}

data: {"candidates":[{"content":{"parts":[{"text":"!"}],"role":"model"},"finishReason":"STOP"}],"usageMetadata":{"promptTokenCount":4,"candidatesTokenCount":3,"totalTokenCount":7}}

Chunk 모양 — 같은 envelope, partial text·json

{
  "candidates": [{
    "content": {
      "parts": [{"text": "hello"}],
      "role": "model"
    },
    "finishReason": "STOP"
  }],
  "usageMetadata": {
    "promptTokenCount": 10,
    "candidatesTokenCount": 5,
    "totalTokenCount": 15
  }
}

curl 로 stream 직접·bash

curl -N \
  "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:streamGenerateContent?alt=sse" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"Tell me a haiku about coffee."}]}]}'

# -N disables curl's output buffering so you actually see chunks as they arrive.

Exercise

위 curl 명령을 -N 으로 실행, 그 다음 -N 없이 다시. chunk 도착 방식의 차이 관찰. 그 다음 출력을 tee stream.log 로 redirect 해서 final event 에 usageMetadata 있고 이전 것엔 없는지 확인. ?alt=sse + -N 이 stream debugging 의 canonical curl 레시피인 이유 한 문장으로 작성.