Comment by gugagore

3 years ago

SVMs typically have weights per data point. I.e. nonparametric/hyper parametric. Modern machine learning doesn't really work like that anymore, right?

The weight per datapoint thing is actually kind of orthogonal to the concept of an SVM, but is conflated by most introductions to SVMs. SVMs are linear models using hinge loss. In the "primal" optimization perspective (rather than the dual problem SVMs are usually formulated as), one optimizes the feature weights like normal. This is not sparse in general, but it's not like dual SVM weights are particularly sparse in practice.

  • Totally. Thank you for expanding on "typically".

    If I can expand on your "kind of", it would be that because of the kernel trick, it actually does matter that the data itself can determine the "linear" (in an infinite dimensional space, that would require infinitely many parameters under the primal formulation) model.

    • Kernelization can be done in primal or dual. Due to the representation theorem, it only ever needs as many parameters as data points. In the primal with a kernel K, you're just doing a feature expansion where each data point x corresponds to a feature whose value at each data point y is just K(x, y).

Yes SVM’s don’t store weights like parametric models but they also don’t store weights “per data point”. Only the points closest to the decision boundary are stored (i.e., the “support vectors”).

The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)