Comment by suninsight
2 days ago
Key questions:
1. The key data point seems to be Figure 6a. Where it compares performance on BABILong and claims Titans performance is at ~62%, as compared to GPT-4o-mini at ~42% for 100k sequence length.
However, GPT-4o and Claude are missing in this comparison - maybe because they perform better ?
2. There is no example provided of the Neural Memory Module in action. This is the first question I would ask of this paper.
The biggest model that they have used has only 760M parameters, and it outperforms models 1 order of magnitude larger.
Gah dmn
This paper was written by a very small team at Google. It strikes me as similar in that regard to the original transformers paper. If this technique scales well, Google is no doubt already exploiting it for their next generation models -- and I think there are signs that Gemini 2.0 models already exploit this.