← Back to context

Comment by albertzeyer

3 years ago

Those equivalences can connect two different fields and allows to transfer methods from one field to the other. Each field usually has developed quite a number of methods and tricks over the time. So when this work shows that they are equivalent (with restrictions), you can maybe use some of the tricks of SVMs and try to use them to improve the Transformer model or its training.

Otherwise, they just help us in better understanding Transformers and SVMs.

There have been similar equivalences before, for example:

Linear Transformers Are Secretly Fast Weight Programmers, https://arxiv.org/abs/2102.11174

Or policy gradient methods from reinforcement learning are basically the same as sequence-discriminative training as it was done for speech recognition since many years, however, they come with different tricks, and combining the tricks was helpful.