Comment by joshuamoyers

3 hours ago

I think the primary reason it works is because the difference between K and Q, which is not all that obvious is that it’s allowing the model to have an asymmetric relationship between tokens, so one token can attend to another without the reverse being true. It seems to me if you just have a single value that you’re representing symmetric relationship, which might degrade the quality of reasoning over a set of tokens, but also is probably possible.

1 comment

joshuamoyers

joshuamoyers 3 hours ago

it seems to be something that’s similar to the class of optimizations associated with with linear or state space attention when things models often do is once they figure out an optimization like this they create a ratio between full resolution blocks and blocks that have the optimization implemented.