1M Token Window의 환상

박스에 적힌 숫자는 feature flag지 일상 도구가 아니야

2026년쯤 되면 모형이 200K, 400K, 1M, 심지어 2M token window를 광고로 흔든다. 마케팅 자료가 깃발처럼 그 숫자를 흔들지. 현실은 더 지저분해 — 들어가는 window가 채워야 하는 window는 아니야. 세 절벽이 기다린다.

절벽 1 — 비용

대부분 주요 provider가 long-context threshold(보통 200K input 근처)를 넘으면 token당 가격이 바뀐다. Anthropic Sonnet은 1M-beta 모드에서 200K 넘으면 input 가격 2배 + output도 의미 있게 뛴다. Gemini 2.5 Pro도 200K 넘으면 input 가격 2배. OpenAI long-context tier도 surcharge가 붙는다. long context는 유료 capability지 무료 업그레이드 아니야.

절벽 2 — Latency

full context의 time-to-first-token(TTFT)이 brutal해. ~1M token prefill은 첫 글자 나오기까지 30초~2분이 정상이야. batch 분석엔 OK, interactive code review엔 사용 불가. "빠름"과 "long context"는 보통 다른 선택이야.

절벽 3 — Attention

비용 감당하고 기다림 견뎌도, 모형이 전체 input을 균일한 fidelity로 attend 안 해. RULER, NoCha 같은 benchmark 보면 effective reasoning이 광고된 한도 한참 전에 무너진다. "읽을 수는 있어"와 "가로질러 사고할 수 있어"는 다른 얘기지.

1M context는 핵발전소지 일상 자동차가 아니야. 진짜로 필요할 땐 강력하지만, 32K window면 더 잘 끝낼 일에 쓰면 비참한 낭비.

Code

long-context 절벽 한눈에·yaml

long_context_cliffs:
  cost:
    threshold: "~200K input on most providers"
    effect: "input price typically 2x; output also rises"
  latency:
    one_million_token_prefill: "30s to 2min TTFT is normal"
    effect: "interactive feel disappears"
  attention:
    benchmark: ["RULER", "NoCha", "LongBench v2"]
    effect: "effective reasoning often 30-50% of advertised window"

long context가 맞는 도구일 때·python

def use_long_context(task):
    return (
        task.is_batch_or_offline
        and task.requires_holistic_reasoning
        and not task.tolerates_retrieval_pipeline
        and task.willing_to_pay_premium_tier
    )
# Otherwise: RAG + small context + tool calls.

박스에 적힌 숫자는 feature flag지 일상 도구가 아니야

절벽 1 — 비용

절벽 2 — Latency

절벽 3 — Attention

Code

External links

Exercise

Progress

댓글 0