Comment by sametoymak
3 years ago
I am one of the authors. The most critical aspect is that transformer is a "different kind of SVM". It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.
The title of the paper does not make this clear but hopefully abstract does :).
Well, guess what, transformer is also a "traditional" SVM that assigns a 0-1 label: https://openreview.net/forum?id=U_T8-5hClV
It is interesting that you have cited this paper but did not even correctly acknowledge their contribution. Yeah I get all that "they are doing X and we are doing X+1" narrative, but the fact that you have defined "good" tokens by multiplying Y_i to your head function, is not much different than "assigning 0-1" label to inputs in traditional SVM. Your "Y_i" essentially serves as a 0-1 label in SVM.
Sounds like a mind game of re-branding existing concepts lol.
Any comment on how the paper relates to "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine" by Pedro Domingos?
https://arxiv.org/pdf/2308.16898.pdf
This seems related to NTK literature i.e. wide neural nets behave like kernel regression. NTK is a great tool but a notable weakness is kernel view doesn't explain how the model learns new features. Transformer is also pretty different from standard neural architectures because tokens interact with each other through attention. Our goal was capturing this interaction and we believe there is a clean insight on feature learning: Attention is running a token-selection procedure by implementing an SVM that separates tokens.
See our re-examination of the kernel equivalence. Path kernels exactly measure how models learn as their understanding of data improves during training, and this can be expressed in terms of the gradients with regards to each trianing input: https://arxiv.org/abs/2308.00824
We believe that all neural networks are effectively an SVM or more generally reproducing kernel architecture to implicitly layer the understanding contributed during each training iteration. Do you have any comment in the RKHS or RKBS context for transformers?
When you say SVM, do you mean any classifier that finds a separating hyperplane, like a no-hidden-layer "perceptron" or Naive Bayes, instead of one which finds the maximum margin hyperplane? Or is finding the maximum margin important here? Thanks. Very interesting.
I think our own brains and nervous system use a step-function as their "activation function", so this could - optimistically - be a throwback to the roots of Rosenblatt's idea.
This SVM summarizes the training dynamics of the attention layer, so there is no hidden-layer. It operates on the token embeddings of that layer. Essentially, weights of the attention layer converge (in direction) to the maximum margin separator between the good vs bad tokens. Note that there is no label involved, instead you are separating the tokens based on their contribution to the training loss. We can formally assign a "score" of each token for 1-layer model but this is tricky to do for multilayer with MLP heads.
Finally, I agree that this is more step-function like. There are caveats we discuss in the paper (i.e. how TF assigns continuous softmax probabilities over the selected tokens).
To me, summary is: Through softmax-attention, transformer is running a "feature/token selection procedure". Thanks to softmax, we can obtain a clean SVM interpretation of max-margin token separation.
> It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
sorry but how is separating 'good' tokens from 'bad' tokens inherently different from assigning a 0-1 label
Here is what I meant:
Standard SVM classifier: Maps an input sequence to a 0-1 label. Example: Take a paragraph and return its sentiment. During training, label is specified.
Transformer's SVM: Takes input sequence, suppresses bad tokens and passes good tokens to the next layer. This is a token-selector rather than classifier.
Example: Take a paragraph and output the salient words in the paragraph. We don't know which words are salient during training, the model has to figure them out during training.
AFAIR, SVMs have one optimal solution, achievable analytically. NN can get stuck in local optima.
If a transformer is a SVM, could we simply extract it out and optimise the hyperplane like for any SVM?
I have read that SVMs as machine learning model failed to take off because of their inability to scale relative to deep neural networks. Would your work point to ways of changing this?
how is your paper different from all the ones like 'transformers are really x' where x is the author's special field of study
IMO it is important to understand transformer mechanics through core ML themes like SVM and feature-selection. Our results are not interpretation, they are mathematically rigorous and numerically verifiable. That said, we have no intention of trivializing a complex model like GPT-4 as a simple SVM. That is a tall order :)
If there is actually equivalence between different type systems and algorithms, that opens the door for simplification through unification.