Async 와 Streaming

.aio surface

client.models 의 모든 method 가 client.aio.models 에 async sibling 있음. 같은 인자, 같은 반환 모양, await 만 추가. 같은 트릭이 client.aio.chats, client.aio.files 등에 적용.

Streaming 은 함수 이름이 바뀌어

Streaming 위해서는 함수 이름 자체가 generate_content 에서 generate_content_stream 으로 바뀌어. non-streaming 은 단일 response 반환; streaming 은 chunk 의 iterator (sync) 또는 async iterator (async) 반환.

왜 streaming 이 중요한가

Flash 의 time to first token (TTFT) 은 보통 200–400ms. 500 단어 답변의 last 토큰 시간은 3–8 초. Streaming 은 TTFT 에서 rendering 시작하게 해서 — 전체 시간 똑같아도 — 극적으로 빠르게 느껴져.

각 chunk 는 partial GenerateContentResponse

chunk 는 full response 와 같은 모양인데 chunk.text 에 text slice 만 들어있음. 진행하면서 concat. usage_metadata 는 final chunk 에만 나타남.

Code

Sync streaming·python

for chunk in client.models.generate_content_stream(
    model='gemini-2.5-flash',
    contents='Tell me a 200-word story about a coffee scale.',
):
    if chunk.text:
        print(chunk.text, end='', flush=True)
print()

Async streaming (production 패턴)·python

import asyncio
from google import genai

client = genai.Client()

async def stream_story():
    full = []
    final_usage = None
    async for chunk in await client.aio.models.generate_content_stream(
        model='gemini-2.5-flash',
        contents='Write a 300-word story.',
    ):
        if chunk.text:
            print(chunk.text, end='', flush=True)
            full.append(chunk.text)
        if chunk.usage_metadata:  # final chunk 에만 채워짐
            final_usage = chunk.usage_metadata
    print(f'\n[total tokens: {final_usage.total_token_count}]')
    return ''.join(full)

asyncio.run(stream_story())

Multimodal 입력으로 streaming·python

from google.genai import types

for chunk in client.models.generate_content_stream(
    model='gemini-2.5-flash',
    contents=[
        'Describe what's in this image:',
        types.Part.from_uri(
            file_uri='gs://my-bucket/photo.jpg',
            mime_type='image/jpeg',
        ),
    ],
):
    if chunk.text:
        print(chunk.text, end='', flush=True)

Exercise

argv 로 prompt 받아 Flash 의 응답을 token-by-token 으로 stdout 에 stream 하고, 마지막에 총 토큰 카운트와 wall-clock 경과 시간 출력하는 작은 CLI 만들어. 그 다음 같은 prompt 와 seed=0 로 두 번 돌림 — text 는 같아야 하고 timing 은 다를 거. TTFT 와 TTLT 의 latency 차이 기록.