Comment by donnietb
2 days ago
I think they tried it already in the original transformer paper. THe results were not worth implementing.
From the paper(where Additive attention is the other "similarity function"):
Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
No comments yet
Contribute on Hacker News ↗