Comment by gwern
3 days ago
K-V caches are large, but hidden states aren't necessarily that large. And if you can run a model once ridiculously fast, then you can loop it repeatedly and still be fast. So I wonder about the 'modern RNNs' like RWKV here...
No comments yet
Contribute on Hacker News ↗