Retention 이 Attention 과 어떻게 다른가

핵심 제약: data-independent decay

토큰 n 의 output 계산 시 토큰 m 에 대한 retention weight 는 γ^(n-m), 토큰 내용 무관. Weight 가 전적으로 recency 로 결정 — 오래된 토큰이 head 당 fixed rate 로 exponentially decay, 끝.

표준 attention 은 반대: weight 가 content-dependent, Q_n 과 K_m 의 dot product 로 결정. 모델이 특정 먼 토큰을 그 내용 때문에 높게 weight 학습 가능, 위치 때문이 아니라.

Expressiveness tradeoff

Data-independent decay 가 RetNet 한테 O(1) 추론 state 주는 것. Recurrence 의 convolution view 가 작동하는 건 γ 가 fixed 라서. 근데 expressiveness 한계 거는 것이기도. RetNet 은 "이 특정 이전 토큰이 중요해 — 추가 attention 줘" 말 못해. 중요성이 전적으로 recency 로 결정.

recency 가 relevance 에 좋은 proxy 인 task 에는 — 대부분 자연어 modeling, 특히 summarization-style task — OK. 모델이 특정 이전 순간에 selectively focus 해야 하는 task 에는 — pure SSM 을 패배시키는 같은 recall-heavy task — retention 부족.

Mental model

Retention 을 automatic exponential forgetting 으로 생각: 모델이 뭘 잊을지 못 골라; 과거가 그냥 fixed schedule 로 fade. 표준 attention 은 학습된 selective focus: 모델이 query 마다 거리 무관 뭐에 attend 할지 고를 수 있어. Architecture 들이 recall-vs-cost spectrum 의 양 끝에 앉아.

Mamba 의 selectivity 는, 회고하면 obvious next move: O(1) 추론 state 유지하지만 모델한테 토큰마다 decay 변동 능력 줘. 그게 RetNet 안 가진 missing piece, 그리고 Mamba 가 더 production traction 가지게 된 이유.

Exercise

두 synthetic task 구성: (a) 최근 토큰이 가장 중요한 summarization-style task (예: 이전 콘텐츠 대부분 noise 인 long context 에서 next token 예측), (b) 특정 오래된 토큰을 retrieve 해야 하는 recall-style task. 각각에 작은 RetNet 과 작은 Transformer 학습. RetNet 이 (a) 에서 Transformer 와 매치하고 (b) 에서 결정적으로 져야. 이 matrix 내재화가 retention-style 모델 고려 시점 알려주는 것.