Comment by ogogmad

3 years ago

When you say SVM, do you mean any classifier that finds a separating hyperplane, like a no-hidden-layer "perceptron" or Naive Bayes, instead of one which finds the maximum margin hyperplane? Or is finding the maximum margin important here? Thanks. Very interesting.

I think our own brains and nervous system use a step-function as their "activation function", so this could - optimistically - be a throwback to the roots of Rosenblatt's idea.

This SVM summarizes the training dynamics of the attention layer, so there is no hidden-layer. It operates on the token embeddings of that layer. Essentially, weights of the attention layer converge (in direction) to the maximum margin separator between the good vs bad tokens. Note that there is no label involved, instead you are separating the tokens based on their contribution to the training loss. We can formally assign a "score" of each token for 1-layer model but this is tricky to do for multilayer with MLP heads.

Finally, I agree that this is more step-function like. There are caveats we discuss in the paper (i.e. how TF assigns continuous softmax probabilities over the selected tokens).

To me, summary is: Through softmax-attention, transformer is running a "feature/token selection procedure". Thanks to softmax, we can obtain a clean SVM interpretation of max-margin token separation.