Comment by ebalit
1 year ago
Transformers can also fetch at any moment any previous information that become useful.
RNN are constantly updating and overwriting their memory. It means they need to be able to predict what is going to be useful in order to store it for later.
This is a massive advantage for Transformers in interactive use cases like in ChatGPT. You give it context and ask questions in multiple turns. Which part of the context was important for a given question only becomes known later in the token sequence.
To be more precise, I should say it's an advantage of Attention-based models, because there are also hybrid models successfully mixing both approaches, like Jamba.
You could theoretically run the input twice, allowing the model to correlate later tokens with previous ones. It would fix the problem with not knowing what information to retain. A more complicated approach would train the RNN to request replaying some earlier data when needed.
A great thing about RNNs is they can easily fork the state and generate trees, it would be possible to backtrack and work on combinatorial search problems.
Also easier to cache demonstrations for free in the initial state, a model that has seen lots of data is not using more memory than a model starting from scratch.
Something like this?
https://hazyresearch.stanford.edu/blog/2024-07-01-jrt
Yes, that's the paper.