Comment by steppi

3 years ago

The article you’ve linked is incorrect. As Dr_Birdbrain said, fitting an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says

Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.

[0] https://github.com/scikit-learn/scikit-learn/blob/2a2772a87b...

[1] https://en.wikipedia.org/wiki/Platt_scaling

[2] https://scikit-learn.org/stable/modules/generated/sklearn.sv...

3 comments

steppi

westurner 3 years ago

Which article is incorrect? Indeed it looks like there is no random initialization in libsvm or thereby sklearn.svm.SVC or in sklearn.svm.*. I seem to have confused random initialization in Simulated Annealing with SVMs; though now TIL that there are annealing SVMs, and SVMs do work with wave functions (though it's optional to map the wave functions into feature space with quantum state tomography), and that there are SVMs for the D-Wave Quantum annealer QC.

From "Support vector machines on the D-Wave quantum annealer" (2020) https://www.sciencedirect.com/science/article/pii/S001046551... :

Kernel-based support vector machines (SVMs) are supervised machine learning algorithms for classification and regression problems. We introduce a method to train SVMs on a D-Wave 2000Q quantum annealer and study its performance in comparison to SVMs trained on conventional computers. The method is applied to both synthetic data and real data obtained from biology experiments. We find that the quantum annealer produces an ensemble of different solutions that often generalizes better to unseen data than the single global minimum of an SVM trained on a conventional computer, especially in cases where only limited training data is available. For cases with more training data than currently fits on the quantum annealer, we show that a combination of classifiers for subsets of the data almost always produces stronger joint classifiers than the conventional SVM for the same parameters.

steppi 2 years ago
My apologies for the ambiguity; I assumed it would be clear from context. The article at the link, https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s..., is incorrect. Whoever wrote it seems to have confused support vector machines with neural networks.
For the D-Wave paper, I'm not sure it's fair that they are comparing an ensemble with a single classifier. I think it would be more fair if they compared their ensemble with a bagging ensemble of linear SVMs which each use the Nystroem kernel approximation [0] and which are each trained using stochastic sub-gradient descent [1].
[0] https://scikit-learn.org/stable/modules/generated/sklearn.ke...
[1] https://scikit-learn.org/stable/modules/sgd.html#classificat...
- westurner 2 years ago
  
  Nystroem method: https://en.wikipedia.org/wiki/Nystr%C3%B6m_method
  6.7 Kernel Approximation > 6.7.1. Nystroem Method for Kernel Approximation https://scikit-learn.org/stable/modules/kernel_approximation...
  Nystroem defaults to an rbf radial basis function and - from quantum logic - Bloch spheres are also radial. Perhaps that's nothing.
  FWIU SVMs w/ kernel trick are graphical models, and NNs are too.
  How much more resource-cost expensive is it to train an ensemble of SVMs than one graphical model with typed relations? What about compared to deep learning for feature synthesis and selection and gradient boosting with xgboost to find the coefficients/exponents of the identified terms of the expression which are not prematurely excluded by feature selection?
  There are algorithmic complexity and algorithmic efficiency metrics that should be relevant to AutoML solution ranking. Opcode cost may loosely correspond to algorithmic complexity.
  [Dask] + Scikeras + Auto-sklearn 2.0 may or may not modify NN topology metaparameters like number of layers and nodes therein at runtime? https://twitter.com/westurner/status/1697270946506170638