I think this is an attempt to try to enrich the locality model in transformers.
One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.
This is obviously not powerful enough to express non-linear relationships - like graph relationships.
This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.
Once upon a time during me being language modeling researcher I built and finetuned a big (at the time - about 5 billions parameters) Sparse Non-Negative Matrix Language Model [1].
As this model allows for mix-and-match of various contexts, one thing that I did is to have a word-sorted context. This effectively transforms position-based context into a word-set based context, where "you and me", "me and you" and "and me you" are the same.
This allowed for longer contexts and better prediction.
Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.
they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute
that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far
I haven’t read the paper yet, but the graph laplacian is quite useful in reordering matrices, so it isn’t that surprising if they managed to get something out of it in ML.
No, its a new form of alchemy that turns electricity into hype. The technical jargon is more.of.a thieves cant to help identity other conmen to one another
If you need help getting more out of ai, you can use chatgpt and co to go through papers and let yourself eli5 paragarphs. 1blue3brown also has a few great videos about transformer and how they work
I think this is an attempt to try to enrich the locality model in transformers.
One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.
This is obviously not powerful enough to express non-linear relationships - like graph relationships.
This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.
Once upon a time during me being language modeling researcher I built and finetuned a big (at the time - about 5 billions parameters) Sparse Non-Negative Matrix Language Model [1].
[1] https://aclanthology.org/Q16-1024/
As this model allows for mix-and-match of various contexts, one thing that I did is to have a word-sorted context. This effectively transforms position-based context into a word-set based context, where "you and me", "me and you" and "and me you" are the same.
This allowed for longer contexts and better prediction.
Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.
For some reason people are still adding position encodings into embeddings.
As if they are not relying on the model's ability to develop its own "positional system bootstrapping on top of the barebones one."
it makes sense architecturally
they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute
that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far
Yup, keyword here is “under the right conditions”.
This may work well for their use case but fail horribly in others without further peer review and testing.
I haven’t read the paper yet, but the graph laplacian is quite useful in reordering matrices, so it isn’t that surprising if they managed to get something out of it in ML.
No, its a new form of alchemy that turns electricity into hype. The technical jargon is more.of.a thieves cant to help identity other conmen to one another
that's a strange way to spell "no, I didn't understand the paper"
Perhaps someone who does understand the paper will kindly make it a bit clearer for those of who get a bit lost.
Try get over your ai hate.
If you need help getting more out of ai, you can use chatgpt and co to go through papers and let yourself eli5 paragarphs. 1blue3brown also has a few great videos about transformer and how they work
Ideologues usually aren't great at primary source understanding/reasoning, hence why they end up with such strong opinions.