RWKV — Transformer quality 의 RNN-like inference

베팅

RWKV ("Receptance-Weighted Key-Value") 가 묻기: RNN-style O(1) 토큰당 inference 로 Transformer-quality output 가능? 답이 RWKV-7 ("Goose", 2025년 3월) 시점 "놀랍게 가까움". RWKV 가 Mamba 와 다른 베팅 — 더 강하게 recurrent 방향 으로.

핵심 아이디어

RWKV 가 attention 을 inference 시 recurrently 계산 가능한 form 으로 reformulation 한 time-mixing 과 channel-mixing operation 사용. 학습 중 attention-like parallel 계산 가능; inference 중 같은 모델이 RNN 처럼 step-by-step evaluate 가능.

왜 중요

Inference 시 constant memory. RWKV 가 길이 무관 O(1) memory 로 sequence 처리. Streaming application 에 ideal.
O(1) 토큰당 inference cost. Attention 계산 없음, KV cache 성장 없음. Resource-constrained 디바이스에 자연스럽게 fit.
Edge / streaming application. Live transcription, on-device assistant, Transformer attention 비실용적인 low-power 시나리오.

RWKV-7 "Goose" 진보

Full-matrix dynamic state decay — 이전 RWKV 버전이 vector decay 사용; v7 이 matrix decay 사용, state-update step 더 expressive 만듦. Recurrent inference 속성 유지하면서 이전 RWKV variant 보다 이론적으로 더 강력.

RWKV 의 현재

RWKV 가 small-to-medium scale (1B–14B) 에서 진짜 경쟁적. Frontier-scale RWKV (70B+) 가 2026 시점 여전히 초기 개발. 커뮤니티 작지만 active, architecture 가 자체 inference engine ecosystem 가짐.

RWKV vs Mamba — 다른 베팅

둘 다 quadratic attention 거부. Mamba 가 structured dynamics 의 selective state space 에 lean; RWKV 가 explicit recurrent inference 의 attention reformulation 에 lean. 수렴 안 함 — post-Transformer design space 의 다른 부분 탐색.

Exercise

Streaming 또는 edge-deployment use case (live transcription, on-device assistant, real-time monitoring) 골라. RWKV-7 paper 의 constant-memory inference claim 읽어. Use case 가 실제로 Transformer attention cost saturate 해서 다른 architecture 정당화하는지, quantization 한 small dense Transformer 가 실제로 똑같이 좋을지 추정.