Comment by cs702
4 days ago
Interesting. I like the idea of a meta-mechanism that learns to update an associative memory based on how surprising the data is. The other stuff, reading memory via keys and values and selectively erasing it with gating, look pretty conventional on a first glance. Thank you for sharing this on HN. I've added it to my reading list.
EDIT: I'm reminded of this other type of associative memory: https://github.com/glassroom/heinsen_routing. The idea there is to compute a mixture of memories that best predicts the given input sequence. Quite frankly, I don't remember how the whole thing works, but I do remember that it works. It's been a while since I used it, so YMMV. In any case, it may be of interest to you.
there's nothing "pretty conventional" about a neural memory mechanism that comes along with such solid evidence of scalability and appealing performance characteristics.
If neural memory was conventional, GPT4o's memory wouldn't be stored as plain text and prepended to prompts.
This paper reminds me of the Switch Transformer paper; e.g. solidifying, expanding on, and proving out an area of research that may well have a big impact on leading LLMs and the SOTA in AI.
Agreed the concept of surprise is very cool.
>the concept of surprise is very cool
Then you may be interested in Simplicity Theory:
https://simplicitytheory.telecom-paris.fr/
In particular this recent paper:
>Unexpectedness and Bayes’ Rule
>A great number of methods and of accounts of rationality consider at their foundations some form of Bayesian inference. Yet, Bayes’ rule, because it relies upon probability theory, requires specific axioms to hold (e.g. a measurable space of events). This short document hypothesizes that Bayes’ rule can be seen as a specific instance of a more general inferential template, that can be expressed also in terms of algorithmic complexities, namely through the measure of unexpectedness proposed by Simplicity Theory.
Source: https://cifma.github.io/Papers-2021/CIFMA_2021_paper_13.pdf
It's hard to take it seriously when every single paper on the subject is from one guy
There definitely is precedent - any parallelizably-decodable CABAC-derived neural compression algorithm basically has a flavor of this idea at its heart - intersperse statistical state throughout your token stream so you can decouple novelty in your state space on the fly.
Taken to its extreme where the ‘memory’ is descriptive enough to deterministically control the decoding you get parallelism over the sequence for free as a consequence of the associativity.
Similar techniques are used in making video compression algorithms robust enough for low latency reconnection in online streaming in poor/changing network conditions, or making it possible to decompress JPEGs at >1GBps in parallel by exploiting the presence of ‘RESET’ tokens that indicate independent/novel substreams.
That said, I do agree that this is definitely a great paper and contribution to language models though!
1991
> Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. This greatly facilitates downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses my neural knowledge distillation procedure of 1991 [UN0-UN2] to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods.
https://people.idsia.ch/~juergen/very-deep-learning-1991.htm...
It's unfortunate that Schmidhuber has both made many seminal contributions to the field, but also engages in "retroactive flag planting" whereby he claims credit for any current successes that are remotely related to anything he has worked on, even if only in terms of hand-wavy problem approach rather than actually building upon his own work.
It's obvious that things like memory, on various timescales (incl. working), selective attention, surprise (i.e. prediction failure) as a learning/memorization signal are going to be part of any AGI solution, but the question is how do you combine and realize these functionalities into an actual cognitive architecture?
Schmidhuber (or in this case you, on his behalf!) effectively saying "I worked on that problem, years ago" is irrelevant. He also worked on LSTMs, which learned to memorize and forget, and the reference section of the "Titans" paper leads to many more recent attempts - different proposed architectures - addressing the same problems around (broadly speaking) learning how best to use working memory. Lots of people suggesting alternatives, but it would seem no compelling solution that has been published.
If it's one of the commercial frontier model labs that does discover the next piece of the architectural puzzle in moving beyond transformers towards AGI, I very much doubt they'll be in any hurry to publish it!
3 replies →