Comment by sametoymak
3 years ago
This SVM summarizes the training dynamics of the attention layer, so there is no hidden-layer. It operates on the token embeddings of that layer. Essentially, weights of the attention layer converge (in direction) to the maximum margin separator between the good vs bad tokens. Note that there is no label involved, instead you are separating the tokens based on their contribution to the training loss. We can formally assign a "score" of each token for 1-layer model but this is tricky to do for multilayer with MLP heads.
Finally, I agree that this is more step-function like. There are caveats we discuss in the paper (i.e. how TF assigns continuous softmax probabilities over the selected tokens).
To me, summary is: Through softmax-attention, transformer is running a "feature/token selection procedure". Thanks to softmax, we can obtain a clean SVM interpretation of max-margin token separation.
No comments yet
Contribute on Hacker News ↗