이 Bottleneck 이 왜 중요한가

Wall 은 이론적이지 않아 — 실제 application 을 막아

complexity argument 를 academic 으로 읽기 쉬워. 안 그래. ~32K 토큰 이상을 소비하고 싶은 모든 워크로드 — 전체 legal brief, 전체 codebase, frame rate 의 audio 시간들, biological sequence, 모든 source 를 concatenate 한 multi-document RAG, 긴 agentic trace — 는 이미 quadratic tax 에 진짜 돈을 내고 있거나, 아예 거부당하고 있어.

Consumer hardware 에서는 wall 이 사람들 생각보다 가까워. 이전 lesson 의 2025 study — 24GB consumer card 에서 Transformer 가 ~25K 토큰에 멈추는 동안 SSM 이 220K+ 도달 — 은 edge case 가 아냐. 자기 머신에서 자기 repo 에 coding assistant 를 돌리려는 모든 사람의 daily experience 야.

Deployment 경제학이 진짜 motivator

128K-context Transformer 모델 serving 은 같은 모델 8K context 대비 request 당 약 3–5× 비싸. 이 비율은 모든 long-context API consumer 가 내는 structural tax 야. Anthropic 의 1M-context Claude 가 별도 pricing tier 를 가진 이유, 어떤 provider 들이 cheap tier 의 context 를 32K-64K 로 cap 하는 이유, commodity GPU 에서 싸게 도는 open-weights long-context model 이 wedge product 가 될 이유 — 다 이거야.

"Alternative" 가 실제 의미하는 것

이 풍경 안에서, "alternative" architecture 는 셋 중 적어도 하나를 팔아. Linear 또는 near-linear 학습 compute — 그래서 더 긴 context 가 학습 cost 를 quadratic 으로 부풀리지 않게. Constant-memory 추론 — 그래서 토큰 당 cost 가 context 길이 무관 flat. 더 나은 hardware mapping — 같은 FLOP 이 더 많은 wall-clock throughput 으로 번역. 이 quest 나머지는 기본적으로 누가 뭘 팔고 — 팔기 위해 뭘 포기하는지 — 의 투어야.

Exercise

8K 토큰 넘는 실제 신경 쓰는 워크로드 하나 골라봐 (네 repo, agent trace log, audio transcript, 뭐든). tiktoken 이나 Llama 3 tokenizer 로 tokenize, 토큰 세고, (1) full sequence length 에서 fp16 attention-matrix memory, (2) 그 length 에서 Llama 3.1 70B GQA 의 KV-cache memory 계산. 이제 "long context 가 나한테 왜 중요해" 에 붙일 구체적인 달러 금액이 생겼어.

Wall 은 이론적이지 않아 — 실제 application 을 막아

Deployment 경제학이 진짜 motivator

"Alternative" 가 실제 의미하는 것

External links

Exercise

Progress

댓글 0