Perfect Recall 와 In-Context Learning

안 사라지는 capability gap

Transformer 는 모든 과거 토큰이 모든 후속 위치에 directly accessible 한 명시적 KV cache 유지. SSM 은 history 를 fixed state 로 압축. Hyena 의 filter 는 content-independent. RWKV 의 state 는 decay. RetNet 의 decay 는 head 당 fixed. Transformer 만 lookup 시점에 원본 토큰 보존, 그게 in-context learning 을 진짜 robust 하게 만드는 속성.

2025 NeurIPS spotlight "Achilles' Heel of Mamba" 가 이를 엄밀히 만듦: pure SSM 이 Transformer 가 성공하는 copy 와 recall task 에서 systematic 하게 실패, 그리고 그 실패는 더 많은 데이터로 patch 안 돼 — structural. Five-shot MMLU 가 canonical 실용적 demonstration: pure SSM 이 substantially lag 하는 이유는 in-context learning 이 few-shot example 을 돌아보고 pattern-matching 필요한데, 압축된 state 가 그 lookup 을 흐리게 해서.

그래서 Transformer 가 wholesale 교체 안 될 거

Recall gap 이 2026 의 production frontier 모델이 여전히 Transformer (또는 의미 있는 attention component 가진 hybrid) 인 이유 이해하는 가장 중요한 single 사실. multi-step tool 사용, observation 누적하는 agentic workflow, distant function signature 존중해야 하는 code generation, 한 번에 10–100 chunk retrieve 해서 reason 하는 RAG 같은 task — 다 Transformer 가 native 로 하고 압축-state alternative 가 못 하는 정확히 그 종류의 정확한 per-token recall 에 의존.

훨씬 적은 attention 으로 이 capability 대부분 회복하는 hybrid build 가능. 회복하는 pure-replacement architecture 는 build 못함. 그 비대칭성이 이 quest 에서 논의한 모든 성공한 production 배포가 Transformer, attention layer 가진 hybrid, 또는 정확한 recall 안 중요한 domain-specific 모델인 이유.

Frontier reasoning angle

Frontier reasoning 워크로드 — 복잡한 수학, multi-step 증명, 정교한 tool 사용, code-wide refactor — 가 특히 recall-heavy. 모델이 수천 토큰에 걸쳐 많은 specific 사실, intermediate 결과, constraint 정의에 대한 coherent reference 유지해야. modern AI 에서 가장 어려운 워크로드, Transformer 의 advantage 가 가장 큰 곳. 이 gap 을 진지하게 address 하는 뭔가 있을 때까지 frontier 가 Transformer family 떠나길 기대 마, 현재 어떤 alternative 도 production scale 에서 안 함.

Exercise

모델한테 패턴 5 example (예: "input → reversed input") 주고 새 input 에 적용 요청하는 synthetic in-context learning task 구성. 비슷한 parameter count 의 작은 Transformer 와 작은 Mamba 이 task 에 돌려. Example separator 와 input length 변경. Transformer 가 패턴 reliably 학습; Mamba 가 example 이 query token 에서 멀어지면서 훨씬 빠르게 degrade. 그 degradation 이 recall gap, 본인 학습 run 에서.

안 사라지는 capability gap

그래서 Transformer 가 wholesale 교체 안 될 거

Frontier reasoning angle

External links

Exercise

Progress

댓글 0