비용, Latency, 그리고 Cache 비대칭성

들어간다고 long context가 공짜는 아니야

큰 context는 prefill 작업을 늘려 — 모형이 첫 token 생성 전에 긴 prefix 처리해야 해. provider가 거대한 window 지원해도 latency가 interactive collaboration엔 unacceptable해질 수 있어. prefill compute가 token 수에 거의 quadratic으로 scale해서, 1M-token 요청이 200K보다 5배 느린 게 아니라 훨씬 더 나빠.

비용 비대칭성 셋

1. Input vs output — output은 보통 input의 3-5배 가격. 2. Cached vs fresh input — cached input이 fresh input보다 10-90% 저렴 (Anthropic cached read는 ~90% 할인; OpenAI, Gemini도 비슷한 tier 할인). 3. Standard vs extended context — long-context threshold(보통 200K) 넘으면 token당 rate 점프.

Worked example

50K-token codebase review가 8K-token feedback 만드는 게 illustrative Sonnet pricing($3/M input, $15/M output, $0.30/M cached)에서 single-shot $0.27, no-cache 10 follow-up은 ~$4.70, prompt caching 켠 10 follow-up은 ~$1.60. 같은 대화, 3배 저렴 — prefix를 reusable하게 구조화한 것만으로.

Interactive vs batch

거대한 context는 one-off 분석엔 perfect, tight edit loop엔 끔찍. 사람이 live로 협업하면 responsiveness가 quality의 일부야. batch job은 latency 용서, interactive session은 안 그래.

Code

cost shape sketch·python

scenarios = {
  "single_shot_50k":          {"input": 0.150, "output": 0.120, "total": 0.27},
  "ten_turns_no_cache":       {"input": 3.50,  "output": 1.20,  "total": 4.70},
  "ten_turns_with_caching":   {"input": 0.40,  "output": 1.20,  "total": 1.60},
}
# Caching is the single biggest cost lever after prompt design.

stable prefix vs variable tail 식별·yaml

stable_prefix:
  - system rules
  - tool schemas
  - reusable examples
  - corpus snippets you re-read each turn
variable_tail:
  - this turn's user question
  - latest diff
  - latest tool result
  - timestamp (if truly needed)

비용, Latency, 그리고 Cache 비대칭성

들어간다고 long context가 공짜는 아니야

비용 비대칭성 셋

Worked example

Interactive vs batch

Code

External links

Exercise

Progress

댓글 0