Context Caching 과 File API

Context caching — 90% 할인

같은 긴 context (PDF, 코드베이스, transcript) 에 질문 여러 개 던질 거면 context 한 번 결제로 cache, 그 다음 각 질문은 토큰당 비용 ~10%. Flash 의 실제 숫자: 일반 입력 $0.30/M → cached 입력 $0.03/M.

Cache 가능 최소 크기

Flash: 1,024 토큰 최소.
Pro: 4,096 토큰 최소.

최소 미만이면 caching 이 no-op — full 입력 단가 결제.

TTL 설정 가능

Default 는 1 시간. ttl='300s' (또는 어떤 duration string) 로 세팅. Cache 가 토큰-시간 당 billing 이라 큰 context 의 긴 TTL 이 그 자체로 비용. 실제 재사용 window 에 TTL 매칭.

한 호출 넘어서 사는 파일에 File API

Multimodal lesson 의 같은 File API. 파일 48 시간 persist, 그 window 동안 multiple cache 또는 generation 호출에 attach 가능.

Code

PDF cache, 질문 여러 개·python

from google import genai
from google.genai import types

client = genai.Client()

# 1. Upload the doc
doc = client.files.upload(
    file='whitepaper.pdf',
    config=dict(mime_type='application/pdf'),
)

# 2. Create a cache
cache = client.caches.create(
    model='gemini-2.5-flash',
    config=types.CreateCachedContentConfig(
        system_instruction='You are a precise document analyst.',
        contents=[doc],
        ttl='300s',  # 5 minutes; raise for longer reuse windows
    ),
)

# 3. Ask many questions, each cheap
for question in ['Summarize.', 'List the methods.', 'What's the headline result?']:
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=question,
        config=types.GenerateContentConfig(
            cached_content=cache.name,
        ),
    )
    print(question, '->', response.text[:120], '...')

# 4. Cleanup
client.caches.delete(name=cache.name)

File API — 명시적 업로드 + 대기·python

import time

uploaded = client.files.upload(
    file='video.mp4',
    config=types.UploadFileConfig(display_name='intro_video'),
)

# Wait for processing if needed
while uploaded.state.name == 'PROCESSING':
    time.sleep(2.5)
    uploaded = client.files.get(name=uploaded.name)

# File is now usable for 48h
# Reuse across multiple calls — the upload is the expensive step
for question in ['Summarize the video.', 'What's the title slide say?']:
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=[uploaded, question],
    )
    print(response.text[:200])

client.files.delete(name=uploaded.name)

Cache 안 할 때·python

# Tiny context — cache overhead > benefit
context = 'Hello, world.'  # 4 tokens
# Don't cache. Just include in contents.

# One-shot question — never reused
context = open('huge_doc.txt').read()
questions = ['Summarize']  # only one
# Cache costs the same as one normal call. No win.

# Reuse > 4 — caching wins on Flash. Reuse > 6 — wins on Pro.
# Math: cache create cost ≈ 1.0× normal input.
# Per-question cached cost ≈ 0.1× normal input.
# Break-even at ~ 1 / (1 - 0.1) ≈ 1.1 reuses (after the first).

Exercise

긴 문서 잡아 (교과서 챕터, 긴 README, 연구 논문 PDF — 5K 토큰 이상). 두 실험 실행: (a) caching 없이 5 질문, 각 호출 시간과 가격 측정; (b) doc cache, 같은 5 질문, 시간과 가격 측정. 합계 비교. Break-even 위에서 caching 이 더 싼지 확인하고 측정한 실제 break-even 작성.