진짜 달러 절약을 위한 prompt caching

Prompt caching이 하는 일

프롬프트의 prefix를 cacheable로 마크하면 Anthropic이 그것에 대한 모델 상태 저장. 같은 prefix 보내는 후속 호출은 re-tokenize 안 하고 cache에서 읽음 — 상당히 싸고 빨라. Cache read는 보통 base 입력 토큰의 일부 비용; cache write는 base 입력 토큰보다 비싸(첫 호출), 그래서 prefix 재사용 시 caching이 보상.

cache_control을 어디 둬

cache_control: {"type": "ephemeral"}를 content 블록에 부착(시스템 텍스트, tools 리스트, 메시지 콘텐츠, 또는 문서). Cache breakpoint가 '여기까지 모두 cacheable' 표시. Anthropic이 요청당 cache breakpoint 4개까지 지원 — system + tools + 안정 history slice에 충분.

Cache를 무효화하는 것

Cached prefix의 어떤 변경이든 cache 무효화. System 프롬프트의 whitespace 변경, tool definition 변경, message 순서 셔플 포함. 안정 prefix가 엔지니어링 목표; 프롬프트 구성이 뭐든 reorder/rewrite하면 cache hit 거의 못 봐.

원칙: Caching은 안정 prefix를 보상해. 프롬프트 앞부분을 의도적으로 boring하게 만들어.

Code

System 프롬프트랑 tools 리스트 cache·python

PIPPA_PERSONA = open("system_prompt.md").read()  # 수십 KB의 stable identity

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": PIPPA_PERSONA,
            "cache_control": {"type": "ephemeral"},  # breakpoint #1
        }
    ],
    tools=[
        # ... 긴 tool 정의 ...
        {"name": "last_tool", "description": "...", "input_schema": {...},
         "cache_control": {"type": "ephemeral"}},  # breakpoint #2
    ],
    messages=[{"role": "user", "content": "hi"}],
)

u = response.usage
print("cache_creation_input:", u.cache_creation_input_tokens)
print("cache_read_input:", u.cache_read_input_tokens)
print("input:", u.input_tokens, "output:", u.output_tokens)

Cache hit 검증·python

# Cache TTL 안에 같은 호출 두 번 (ephemeral 디폴트 5분).
for i in (1, 2):
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=64,
        system=[{"type": "text", "text": LARGE_STABLE_TEXT, "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": f"call {i}"}],
    )
    u = r.usage
    print(f"call {i}: write={u.cache_creation_input_tokens}, read={u.cache_read_input_tokens}")
# 기대: call 1은 write > 0, read = 0. Call 2는 write = 0, read > 0.

Exercise

10 호출에 걸쳐 안정 preamble을 재전송하는 프롬프트 골라. Preamble 경계에 cache_control breakpoint 추가. Caching 없을 때 vs 있을 때(cache rate의 read 토큰 + 한 번 write) 총 비용 비교.

Hint

Cache write가 일반 input 토큰보다 비싸(약 25%), 하지만 read는 극적으로 싸. Break-even은 보통 호출 2-3번.

진짜 달러 절약을 위한 prompt caching

Prompt caching이 하는 일

cache_control을 어디 둬

Cache를 무효화하는 것

Code

External links

Exercise

Progress

댓글 0