← Back to context

Comment by oofbey

5 hours ago

The big bet with this technique is in having a fixed (non learned) matrix which converts the tokens latent space to the linear attention space. So you can kinda cheat and say your model is small because a bunch of the smarts are in this fixed big graph laplacian matrix L.

So how do you scale this up from a toy problem? Well that L would Have to get bigger. And it’s hard to imagine it being useful if L is not trained. Then it starts to look a lot more like a conventional transformer, but probably harder to train, with the benefit of smaller KV caches. (Half the size - not a massive win.)

So overall doesn’t seem to me like it’s gonna amount to anything.