Comment by kannanvijayan

1 month ago

I think this is an attempt to try to enrich the locality model in transformers.

One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.

This is obviously not powerful enough to express non-linear relationships - like graph relationships.

This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.

8 comments

kannanvijayan

thesz 1 month ago

  > like graph relationships

Once upon a time during me being language modeling researcher I built and finetuned a big (at the time - about 5 billions parameters) Sparse Non-Negative Matrix Language Model [1].

[1] https://aclanthology.org/Q16-1024/

As this model allows for mix-and-match of various contexts, one thing that I did is to have a word-sorted context. This effectively transforms position-based context into a word-set based context, where "you and me", "me and you" and "and me you" are the same.

This allowed for longer contexts and better prediction.

nickpsecurity 1 month ago
I've saved it to look at it in the future. I also remembered Kristina Tautanova's name (your editor). Looking up recent publications, she's done interesting work on analyzing pretraining mixtures.
https://aclanthology.org/2025.acl-long.1564/
Thanks to you both for two, interesting papers tonight. :)
- thesz 1 month ago
  
  I am not an author of SNMLM paper. ;)
  I was using their model in my work.
  
  2 replies →

adroniser 1 month ago

Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.

thesz 1 month ago

For some reason people are still adding position encodings into embeddings.
As if they are not relying on the model's ability to develop its own "positional system bootstrapping on top of the barebones one."

tuned 1 month ago

> This is obviously not powerful enough to express non-linear relationships - like graph relationships.

the distance metrics used is based on energy-informed graphs that encode energy relations in a distribution called taumode, see my previous paper on spectral indexing for vector databases for a complete roll-out