Comment by kouteiheika
9 hours ago
If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks.
This has been done successfully in the past:
https://huggingface.co/featherless-ai/QRWKV-72B
Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.
I'd say try the nanogpt speedrun. It's much easier to train, and gives you a better comparison vs optimized systems.
https://github.com/KellerJordan/modded-nanogpt
Labs were also competing to train BERT's for $20 or less. People still use them a lot, too.
https://www.databricks.com/blog/mosaicbert
I'll add they should do a number of small, training runs with different architectures and data mixes. That proves generalization.
This is interesting. Has there been more research into this architecture? I hear about it once every few years but it always seems like a niche / experimental thing. But based on the graph in their blog post you'd expect every company to be using this.
That doesn’t tell you if the new method continues to perform better at higher parameter counts.
Nor that the training from scratch will even work.
Depending on how different the attention mechanism is, that might not work. If it’s just a faster / different way of finding the tokens to attend to, sure. But I get the sense the author is implying this method uses different semantics somehow. Although tbh I didn’t follow it entry.