Comment by westurner

3 years ago

SVMs are randomly initialized (with arbitrary priors) and then are deterministic.

From "What Is the Random Seed on SVM Sklearn, and Why Does It Produce Different Results?" https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s... :

> When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.

> The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the code

From "Random Initialization For Neural Networks : A Thing Of The Past" (2018) https://towardsdatascience.com/random-initialization-for-neu... :

> Lets look at three ways to initialize the weights between the layers before we start the forward, backward propagation to find the optimum weights.

> 1: zero initialization

> 2: random initialization

> 3: he-et-al initialization

Deep learning: https://en.wikipedia.org/wiki/Deep_learning

SVM: https://en.wikipedia.org/wiki/Support_vector_machine

Is it guaranteed that SVMs converge upon a solution regardless of random seed?

7 comments

westurner

Dr_Birdbrain 3 years ago

An SVM is a quadratic program, which is convex. This means that they should always converge and they should always converge to the same global optimum, regardless of initialization, as long as they are feasible, I.e. as long as the two classes can be separated by an SVM.

steppi 3 years ago

The soft-margin SVM which can handle misclassifications is also convex and has a unique global optimum [0].
[0] https://stackoverflow.com/a/12610455/992102
westurner 3 years ago

> as long as the two classes can be separated by an SVM.
Are the classes separable with e.g. the intertwined spiral dataset in the TensorFlow demo? Maybe only with a radial basis function kernel?
Separable state https://news.ycombinator.com/item?id=37369783

steppi 3 years ago

The article you’ve linked is incorrect. As Dr_Birdbrain said, fitting an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says

Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.

[0] https://github.com/scikit-learn/scikit-learn/blob/2a2772a87b...

[1] https://en.wikipedia.org/wiki/Platt_scaling

[2] https://scikit-learn.org/stable/modules/generated/sklearn.sv...

westurner 3 years ago
Which article is incorrect? Indeed it looks like there is no random initialization in libsvm or thereby sklearn.svm.SVC or in sklearn.svm.*. I seem to have confused random initialization in Simulated Annealing with SVMs; though now TIL that there are annealing SVMs, and SVMs do work with wave functions (though it's optional to map the wave functions into feature space with quantum state tomography), and that there are SVMs for the D-Wave Quantum annealer QC.
From "Support vector machines on the D-Wave quantum annealer" (2020) https://www.sciencedirect.com/science/article/pii/S001046551... :
Kernel-based support vector machines (SVMs) are supervised machine learning algorithms for classification and regression problems. We introduce a method to train SVMs on a D-Wave 2000Q quantum annealer and study its performance in comparison to SVMs trained on conventional computers. The method is applied to both synthetic data and real data obtained from biology experiments. We find that the quantum annealer produces an ensemble of different solutions that often generalizes better to unseen data than the single global minimum of an SVM trained on a conventional computer, especially in cases where only limited training data is available. For cases with more training data than currently fits on the quantum annealer, we show that a combination of classifiers for subsets of the data almost always produces stronger joint classifiers than the conventional SVM for the same parameters.
- steppi 2 years ago
  
  My apologies for the ambiguity; I assumed it would be clear from context. The article at the link, https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s..., is incorrect. Whoever wrote it seems to have confused support vector machines with neural networks.
  For the D-Wave paper, I'm not sure it's fair that they are comparing an ensemble with a single classifier. I think it would be more fair if they compared their ensemble with a bagging ensemble of linear SVMs which each use the Nystroem kernel approximation [0] and which are each trained using stochastic sub-gradient descent [1].
  [0] https://scikit-learn.org/stable/modules/generated/sklearn.ke...
  [1] https://scikit-learn.org/stable/modules/sgd.html#classificat...
  
  1 reply →