← Back to context

Comment by mjhay

3 years ago

The weight per datapoint thing is actually kind of orthogonal to the concept of an SVM, but is conflated by most introductions to SVMs. SVMs are linear models using hinge loss. In the "primal" optimization perspective (rather than the dual problem SVMs are usually formulated as), one optimizes the feature weights like normal. This is not sparse in general, but it's not like dual SVM weights are particularly sparse in practice.

Totally. Thank you for expanding on "typically".

If I can expand on your "kind of", it would be that because of the kernel trick, it actually does matter that the data itself can determine the "linear" (in an infinite dimensional space, that would require infinitely many parameters under the primal formulation) model.

  • Kernelization can be done in primal or dual. Due to the representation theorem, it only ever needs as many parameters as data points. In the primal with a kernel K, you're just doing a feature expansion where each data point x corresponds to a feature whose value at each data point y is just K(x, y).