I am one of the authors. The most critical aspect is that transformer is a "different kind of SVM". It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.
The title of the paper does not make this clear but hopefully abstract does :).
It is interesting that you have cited this paper but did not even correctly acknowledge their contribution. Yeah I get all that "they are doing X and we are doing X+1" narrative, but the fact that you have defined "good" tokens by multiplying Y_i to your head function, is not much different than "assigning 0-1" label to inputs in traditional SVM. Your "Y_i" essentially serves as a 0-1 label in SVM.
Sounds like a mind game of re-branding existing concepts lol.
This seems related to NTK literature i.e. wide neural nets behave like kernel regression. NTK is a great tool but a notable weakness is kernel view doesn't explain how the model learns new features. Transformer is also pretty different from standard neural architectures because tokens interact with each other through attention. Our goal was capturing this interaction and we believe there is a clean insight on feature learning: Attention is running a token-selection procedure by implementing an SVM that separates tokens.
When you say SVM, do you mean any classifier that finds a separating hyperplane, like a no-hidden-layer "perceptron" or Naive Bayes, instead of one which finds the maximum margin hyperplane? Or is finding the maximum margin important here? Thanks. Very interesting.
I think our own brains and nervous system use a step-function as their "activation function", so this could - optimistically - be a throwback to the roots of Rosenblatt's idea.
This SVM summarizes the training dynamics of the attention layer, so there is no hidden-layer. It operates on the token embeddings of that layer. Essentially, weights of the attention layer converge (in direction) to the maximum margin separator between the good vs bad tokens. Note that there is no label involved, instead you are separating the tokens based on their contribution to the training loss. We can formally assign a "score" of each token for 1-layer model but this is tricky to do for multilayer with MLP heads.
Finally, I agree that this is more step-function like. There are caveats we discuss in the paper (i.e. how TF assigns continuous softmax probabilities over the selected tokens).
To me, summary is: Through softmax-attention, transformer is running a "feature/token selection procedure". Thanks to softmax, we can obtain a clean SVM interpretation of max-margin token separation.
> It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
sorry but how is separating 'good' tokens from 'bad' tokens inherently different from assigning a 0-1 label
Standard SVM classifier: Maps an input sequence to a 0-1 label. Example: Take a paragraph and return its sentiment. During training, label is specified.
Transformer's SVM: Takes input sequence, suppresses bad tokens and passes good tokens to the next layer. This is a token-selector rather than classifier.
Example: Take a paragraph and output the salient words in the paragraph. We don't know which words are salient during training, the model has to figure them out during training.
I have read that SVMs as machine learning model failed to take off because of their inability to scale relative to deep neural networks. Would your work point to ways of changing this?
IMO it is important to understand transformer mechanics through core ML themes like SVM and feature-selection. Our results are not interpretation, they are mathematically rigorous and numerically verifiable. That said, we have no intention of trivializing a complex model like GPT-4 as a simple SVM. That is a tall order :)
Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I'm guessing it can't reduce the number of computations purely from an information theory argument, but doesn't it imply a radically simpler and easier to implement architecture?
I just read the abstract so could be way off, but sounds more like one of those papers that connect seemingly different mathematical formalisms and show their equivalence (often under some restrictions). Typically they don’t give us much immediate benefit in terms of implementation, but they add to the intuitive understanding of what we’re doing, and sometimes helps others make more practical progress.
I'm not an expert in this, so hopefully someone more knowledgeable can weight in - but SVMs are understood much better from the perspective over overfitting and things like the VC bound - while Transformers are not really understood as well. From what I remember it's quite easy to have a SVM overfit, while Transformers have fewer issues. It'd be interesting to understand why
So if the two are somehow connected, then that could have implications for tuning and fighting overfitting
maybe it'd also be possible to design better non-overfitting SVMs
> From what I remember it's quite easy to have a SVM overfit ... It'd be interesting to understand why
SVMs with well-tuned kernels and regularization are reasonably resistant to overfitting. The problem is that you can easily end up overfitting the hyperparameters if you're not very careful about how you do performance testing.
Those equivalences can connect two different fields and allows to transfer methods from one field to the other. Each field usually has developed quite a number of methods and tricks over the time. So when this work shows that they are equivalent (with restrictions), you can maybe use some of the tricks of SVMs and try to use them to improve the Transformer model or its training.
Otherwise, they just help us in better understanding Transformers and SVMs.
There have been similar equivalences before, for example:
Or policy gradient methods from reinforcement learning are basically the same as sequence-discriminative training as it was done for speech recognition since many years, however, they come with different tricks, and combining the tricks was helpful.
I am waiting for someone publishing the theoretical limits of these "AI" systems. They're certainly impressive language models - don't get me wrong on that. But every algorithm and every model has its limits. To know the limits turns their application from hype into engineering. And of course, the hype-sellers will try to keep that from happening as long as possible.
This theorem explain the limits, putting it in simple terms, most architectures are universal approximators that are constrained by the inductive bias that we give them, so far the approximator arquitectured that is less constrained by the inductive bias is the transformer, so it should be able to approximate any mathematical function, the current problem is that the attention mechanism have a quadratic scaling, so while is easy to scale it in text, is pretty hard to scale it with anything else to the same performance, even if not further discoveries are made, just with the computer power of the future it should be able to scale in every field, even with the techniques of today it gives pretty good performance in a lot of tasks.
This review of the paper an image is worth 16x16 words by Yannic Kilcher explains it better if you are interested.
What are the fundamental limits of language itself? Is English somehow more "emergent" than Korean? Isn't this more interesting than the actual execution mechanism?
The business of these new LLMs is next token prediction with context. This is also now a mission because it clearly works to some large extent. Where most would not have been willing to take a leap of faith prior, many can see some path now. I've been able to suspend my disbelief around language-as-computation long enough to discover new options.
You're looking for the universal approximation theorem. It's one of those cases where they can do anything in theory so the question is more are we chasing a turning tarpit or not, where everything is possible but nothing is easy
Fully connected neural networks are hierarchies of logistic regression nodes. Transformers are networks of SVM nodes. I guess we can expect networks of other kinds of classifiers in the future. Perhaps networks of Decision Tree nodes? Mix and match?
NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions.
Or, maybe more clearly: imagine taking any classification algorithm and drawing the graph of all of its predictions across it's domain. Then just construct a decision tree which "draws splits" along the original alg's decision edges.
Likewise, all ML is equivalent to a KNN parameterised on an averaging operation.
Everything here is eqv to everything else. ML is just computing an expectation over a training dataset, weighted by the model parameters.
The "value" comes from the (copyright laundering/) data. The only question is: can you find useful weights by which to control the expectation you're taking?
Various ML approaches weight the training data differently. The most successful of the latest round of AI manages to compute weights across everything ever written --- hence more useful than naive KNN which wouldnt terminate on 1PB of text.
> NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions
By that argument, every computation can be reduced to a lookup table. Take every possible input, memorize the correct output and store it in a database of sorts.
If decision trees were truly equivalent to NNs, you would be able to solve any problem currently addressed with NNs but using only decision trees without learning from the output of the NN. Same input datasets same output quality metrics.
Not really feasible, is it?
Likewise with all the other equivalences you made here.
> When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.
> The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the code
An SVM is a quadratic program, which is convex. This means that they should always converge and they should always converge to the same global optimum, regardless of initialization, as long as they are feasible, I.e. as long as the two classes can be separated by an SVM.
The article you’ve linked is incorrect. As Dr_Birdbrain said, fitting an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.
Which article is incorrect? Indeed it looks like there is no random initialization in libsvm or thereby sklearn.svm.SVC or in sklearn.svm.*. I seem to have confused random initialization in Simulated Annealing with SVMs; though now TIL that there are annealing SVMs, and SVMs do work with wave functions (though it's optional to map the wave functions into feature space with quantum state tomography), and that there are SVMs for the D-Wave Quantum annealer QC.
Kernel-based support vector machines (SVMs) are supervised machine learning algorithms for classification and regression problems. We introduce a method to train SVMs on a D-Wave 2000Q quantum annealer and study its performance in comparison to SVMs trained on conventional computers. The method is applied to both synthetic data and real data obtained from biology experiments. We find that the quantum annealer produces an ensemble of different solutions that often generalizes better to unseen data than the single global minimum of an SVM trained on a conventional computer, especially in cases where only limited training data is available. For cases with more training data than currently fits on the quantum annealer, we show that a combination of classifiers for subsets of the data almost always produces stronger joint classifiers than the conventional SVM for the same parameters.
The weight per datapoint thing is actually kind of orthogonal to the concept of an SVM, but is conflated by most introductions to SVMs. SVMs are linear models using hinge loss. In the "primal" optimization perspective (rather than the dual problem SVMs are usually formulated as), one optimizes the feature weights like normal. This is not sparse in general, but it's not like dual SVM weights are particularly sparse in practice.
If I can expand on your "kind of", it would be that because of the kernel trick, it actually does matter that the data itself can determine the "linear" (in an infinite dimensional space, that would require infinitely many parameters under the primal formulation) model.
Yes SVM’s don’t store weights like parametric models but they also don’t store weights “per data point”. Only the points closest to the decision boundary are stored (i.e., the “support vectors”).
The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)
Fuck, imagine how many doctoral theses I could've written every time I tweaked a few lines of code to try some abstract way of recombining outputs I didn't fully understand. I missed the boat. All this jargon is absolutely for show, though. Purely intended to create the impression that there's some kind of moat to the "discovery". There are much clearer ways to express "we fucked around with putting the outputs of this black box back into the inputs", but I guess that doesn't impress the rubes.
I think this applies to everything right now. Papers like this are just ridiculous examples. In like, 6th grade I won second place at the LA county science fair for coding a simulation of a coyote's life in hypercard (with tons of graphs). Yay. Y'know what? That shit and those graphs would've been incomprehensible to the judges if it hadn't been written in plain language, in an attempt to make them understand what they were looking at. My entire career since has been an attempt to communicate and alleviate the pain points in communication between parties, by way of writing software that encapsulated their descriptions of what they needed. And likewise I never pretended to be smarter or know more than my clients did: Everything must be explained and comprehensible in normal people language. People need to know how shit works, especially if they're paying for it.
Or they should.
Or if they don't know and don't care, they're fucking negligent.
Especially if they say "wow that sounds smart, let's let these guys run our weapons program".
To your point, the reason this ornate language thrives and people get away with complacency about how their own systems work, boils down to a silent pact between managers and engineers to sweep everything under the rug out of laziness and ill-will. There's something blatantly mendacious and evil (in the banal way) about the agreement that managers approve black boxes which were approved by complex-sounding papers so that upper management can wash their hands of the results.
[edit] maybe I'm just bitter because I spent hours today pondering exactly how many engineers at Monsanto must have known about the dangers of the astroturf, and how many raised their hand, or hid behind a spreadsheet
What I find entertaining/confounding is how difficult the abstracts to these new AI papers are to understand. It feels like academia is pushing this style, so it’s hard to blame the authors since they have to play the game.
For reference I have an undergrad degree in computer science, have been working professionally for 25 years, and am fairly data centric in my work.
I’m hoping when I run this through GPT4 to get an explanation for a mortal software developer something sensible comes out the other end.
"Not math-y enough"/ "Needs more math" is a very common feedback ML/AI researchers get when writing papers.
The other day I was watching a live-stream of a doctoral defense, as the thesis was quite relevant to my work.
So one of the committee members would really pick and criticize the math - ask questions like "You are supposed to be the bleeding edge on this topic, why was the math so simple? Did you research more rigorous theories to explain the math?" etc. (He was awarded the doctorate though)
So, I dunno, if that's how things are now - it makes sense to me that the authors go overboard with complicated notation, even if they could have written it much simpler. Probably makes the work seem more rigorous and legit.
Doesn't really take that much more time, and it covers your ass from "not rigorous enough" gotchas - though at the expense of readability.
If the excuse is true and the "ornate" language really is a dense representation of information then it should be fairly trivial to have an LLM agent unsummarize it.
There could be a webservice that offers a parallel track of layman's translations of any paper.
I wish also. When I was young and new, so much wasted time trying to parse the 'arcane' math that was really something simple bug dressed up as complicated to give it weight.
Watching the AI community rediscover automatic differentiation 20+ years after the field was considered "mature" was equal parts frustrating & fascinating. The frustration was them rewriting the history of discovery, but without any sort of sense or rigor ... and it was also the most fascinating!
I'm waiting for some fresh group of grad students to make a breakthrough using a reinvented version of Pearls "Do" calculus or maybe they make some narrow breakthrough using BayesNets and everyone geeks out on those for a while
*I do think transformers (much like ff networks + backprop from 2012-2018) are probably a lasting software architecture for inference applications until we come up with new hardware, and move beyond GPU focused computing
It's exciting to see it all working, but disheartening how a-historical this last few years has been in AI - with the exception of Brooks, Sutton and a few other greybeards in the field who say similarly
I think the main motivation in ml theory that touches current SOTA is not "expressing simple ideas with a jargon for show". Jargon is necessary, as much as some (mostly very practical) engineers or software people cannot see it due to how unnecessary it seems to them (as they are used to practically and quickly express themselves). It's a jargon for the mathematics of machine learning, which is pretty unstandardized so to speak. So you need to define yourself. And without a jargon and clear proofs, what you do is just brainstorming at most. The value of such work is that their statement is pretty clear, proved and contain hypotheses which can be tested by the future papers.
Here is an example: to explain the existence of adversarial example, there are 2 suggestions without a jargon: 1) that the decision boundary is too nonlinear, 2) that the decision boundary is too linear. Both of these explanations contradict and stated without any real proof and unfortunately can be widely heard in most of the adversarial example papers. If we were to have clear formulations of these two statements, we could have tested both of these claims but unfortunately the papers that suggested these theories didn't put effort for defining a jargon and putting their suggestion as a clear-formal statement.
I studied using ML just over a decade ago. I actually compared MLPs to SVMs and had a similar thought to this. It does seem like there is a regression on understanding some of the fundamentals and older tools of the trade.
I guess everyone gets focused on the newer things.
Really does seem like people rediscovering older endpoints.
There's been a huge flood of vanilla software engineers into ML, retconning it as "a subfield of computer science" (computability is a minor concern compared to the statistical underpinnings). They pretend to know the math because they can read the equations, then claim with utmost confidence that actually they're doing all the hard work in ML because they are experts in calling APIs and integrating into products, however useful or useless.
> we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points
does this mean 'an over-parameterized transformer problem is a convex svm problem'?
I wouldn’t go this far, applied ML articles are my favorite articles. If you’re in the arena, it’s good to see things that other people have done from a practical perspective so you can ape it in your own work or not give it further consideration.
Turing machines can also be used as universal function approximators. But I'm not sure it makes sense to put them in the same category as the other two.
I would love to put them in the same category as the other two. In fact I’ve spent quite a lot if time thinking about it / experimenting. Wouldn’t it be great if we could somehow train on data and get a small Turing machine instead of a huge neural network?
The term hyperplane already assumes that the hypothesis space that your learning algorithm searches has some kind of dimension and is some variant of an Euclidean / vector space (and its generalisations). This is not the case for many forms of ML, for example grammar induction (where the hypothesis space is Chomsky-style grammars) or inductive logic programming (hypothesis space are Prolog (or similar) programs), or, more generally, program synthesis (where programs form the hypothesis space).
Someday I'm going to write a paper that achieves SOTA results with a nigh-incomprehensible mishmash of diverse techniques and title it "All You Need Considered Harmful".
A hyperplane is a multi-dimensional linear function that splits space into two distinct regions. In the context of a classifier, it splits feature space into disjunct sub-spaces (one for each class). SVMs effectively place a hyperplane with maximum margin, thereby separating classes in an optimal way.
I regret to inform you, I don’t think it’s the same set of people. Writing a cute “Transformers are SVMs” paper and “building chatGPT” are not the same skillset.
I am one of the authors. The most critical aspect is that transformer is a "different kind of SVM". It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.
The title of the paper does not make this clear but hopefully abstract does :).
Well, guess what, transformer is also a "traditional" SVM that assigns a 0-1 label: https://openreview.net/forum?id=U_T8-5hClV
It is interesting that you have cited this paper but did not even correctly acknowledge their contribution. Yeah I get all that "they are doing X and we are doing X+1" narrative, but the fact that you have defined "good" tokens by multiplying Y_i to your head function, is not much different than "assigning 0-1" label to inputs in traditional SVM. Your "Y_i" essentially serves as a 0-1 label in SVM.
Sounds like a mind game of re-branding existing concepts lol.
Any comment on how the paper relates to "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine" by Pedro Domingos?
https://arxiv.org/pdf/2308.16898.pdf
This seems related to NTK literature i.e. wide neural nets behave like kernel regression. NTK is a great tool but a notable weakness is kernel view doesn't explain how the model learns new features. Transformer is also pretty different from standard neural architectures because tokens interact with each other through attention. Our goal was capturing this interaction and we believe there is a clean insight on feature learning: Attention is running a token-selection procedure by implementing an SVM that separates tokens.
1 reply →
When you say SVM, do you mean any classifier that finds a separating hyperplane, like a no-hidden-layer "perceptron" or Naive Bayes, instead of one which finds the maximum margin hyperplane? Or is finding the maximum margin important here? Thanks. Very interesting.
I think our own brains and nervous system use a step-function as their "activation function", so this could - optimistically - be a throwback to the roots of Rosenblatt's idea.
This SVM summarizes the training dynamics of the attention layer, so there is no hidden-layer. It operates on the token embeddings of that layer. Essentially, weights of the attention layer converge (in direction) to the maximum margin separator between the good vs bad tokens. Note that there is no label involved, instead you are separating the tokens based on their contribution to the training loss. We can formally assign a "score" of each token for 1-layer model but this is tricky to do for multilayer with MLP heads.
Finally, I agree that this is more step-function like. There are caveats we discuss in the paper (i.e. how TF assigns continuous softmax probabilities over the selected tokens).
To me, summary is: Through softmax-attention, transformer is running a "feature/token selection procedure". Thanks to softmax, we can obtain a clean SVM interpretation of max-margin token separation.
> It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.
sorry but how is separating 'good' tokens from 'bad' tokens inherently different from assigning a 0-1 label
Here is what I meant:
Standard SVM classifier: Maps an input sequence to a 0-1 label. Example: Take a paragraph and return its sentiment. During training, label is specified.
Transformer's SVM: Takes input sequence, suppresses bad tokens and passes good tokens to the next layer. This is a token-selector rather than classifier.
Example: Take a paragraph and output the salient words in the paragraph. We don't know which words are salient during training, the model has to figure them out during training.
AFAIR, SVMs have one optimal solution, achievable analytically. NN can get stuck in local optima.
If a transformer is a SVM, could we simply extract it out and optimise the hyperplane like for any SVM?
I have read that SVMs as machine learning model failed to take off because of their inability to scale relative to deep neural networks. Would your work point to ways of changing this?
how is your paper different from all the ones like 'transformers are really x' where x is the author's special field of study
IMO it is important to understand transformer mechanics through core ML themes like SVM and feature-selection. Our results are not interpretation, they are mathematically rigorous and numerically verifiable. That said, we have no intention of trivializing a complex model like GPT-4 as a simple SVM. That is a tall order :)
If there is actually equivalence between different type systems and algorithms, that opens the door for simplification through unification.
Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I'm guessing it can't reduce the number of computations purely from an information theory argument, but doesn't it imply a radically simpler and easier to implement architecture?
I just read the abstract so could be way off, but sounds more like one of those papers that connect seemingly different mathematical formalisms and show their equivalence (often under some restrictions). Typically they don’t give us much immediate benefit in terms of implementation, but they add to the intuitive understanding of what we’re doing, and sometimes helps others make more practical progress.
I'm not an expert in this, so hopefully someone more knowledgeable can weight in - but SVMs are understood much better from the perspective over overfitting and things like the VC bound - while Transformers are not really understood as well. From what I remember it's quite easy to have a SVM overfit, while Transformers have fewer issues. It'd be interesting to understand why
So if the two are somehow connected, then that could have implications for tuning and fighting overfitting
maybe it'd also be possible to design better non-overfitting SVMs
> From what I remember it's quite easy to have a SVM overfit ... It'd be interesting to understand why
SVMs with well-tuned kernels and regularization are reasonably resistant to overfitting. The problem is that you can easily end up overfitting the hyperparameters if you're not very careful about how you do performance testing.
Those equivalences can connect two different fields and allows to transfer methods from one field to the other. Each field usually has developed quite a number of methods and tricks over the time. So when this work shows that they are equivalent (with restrictions), you can maybe use some of the tricks of SVMs and try to use them to improve the Transformer model or its training.
Otherwise, they just help us in better understanding Transformers and SVMs.
There have been similar equivalences before, for example:
Linear Transformers Are Secretly Fast Weight Programmers, https://arxiv.org/abs/2102.11174
Or policy gradient methods from reinforcement learning are basically the same as sequence-discriminative training as it was done for speech recognition since many years, however, they come with different tricks, and combining the tricks was helpful.
I am waiting for someone publishing the theoretical limits of these "AI" systems. They're certainly impressive language models - don't get me wrong on that. But every algorithm and every model has its limits. To know the limits turns their application from hype into engineering. And of course, the hype-sellers will try to keep that from happening as long as possible.
Hey,
https://en.wikipedia.org/wiki/Universal_approximation_theore...
This theorem explain the limits, putting it in simple terms, most architectures are universal approximators that are constrained by the inductive bias that we give them, so far the approximator arquitectured that is less constrained by the inductive bias is the transformer, so it should be able to approximate any mathematical function, the current problem is that the attention mechanism have a quadratic scaling, so while is easy to scale it in text, is pretty hard to scale it with anything else to the same performance, even if not further discoveries are made, just with the computer power of the future it should be able to scale in every field, even with the techniques of today it gives pretty good performance in a lot of tasks.
This review of the paper an image is worth 16x16 words by Yannic Kilcher explains it better if you are interested.
https://youtu.be/TrdevFK_am4?t=1314
2 replies →
Hype sellers, despite being annoying and noisy, are not the reason why it's hard to figure out the theoretical limits.
To put it the form of a rhetorical question: many of these models are public, so why "wait" when you could do the research yourself?
> I am waiting for someone publishing the theoretical limits of these "AI" systems.
> To know the limits turns their application from hype into engineering.
It would be helpful to know how the models actually work under the hood.
But we made very good use of metals for thousands of years before we understood things like atoms, chemical bonds, lattices, etc.
Some engineering disciplines can be made up largely of empirical knowledge.
Engineering to me is "make the things we want out of the things we have", and not necessarily "design based on complete scientific theories".
1 reply →
What are the fundamental limits of language itself? Is English somehow more "emergent" than Korean? Isn't this more interesting than the actual execution mechanism?
The business of these new LLMs is next token prediction with context. This is also now a mission because it clearly works to some large extent. Where most would not have been willing to take a leap of faith prior, many can see some path now. I've been able to suspend my disbelief around language-as-computation long enough to discover new options.
You're looking for the universal approximation theorem. It's one of those cases where they can do anything in theory so the question is more are we chasing a turning tarpit or not, where everything is possible but nothing is easy
Fully connected neural networks are hierarchies of logistic regression nodes. Transformers are networks of SVM nodes. I guess we can expect networks of other kinds of classifiers in the future. Perhaps networks of Decision Tree nodes? Mix and match?
NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions.
Or, maybe more clearly: imagine taking any classification algorithm and drawing the graph of all of its predictions across it's domain. Then just construct a decision tree which "draws splits" along the original alg's decision edges.
Likewise, all ML is equivalent to a KNN parameterised on an averaging operation.
Everything here is eqv to everything else. ML is just computing an expectation over a training dataset, weighted by the model parameters.
The "value" comes from the (copyright laundering/) data. The only question is: can you find useful weights by which to control the expectation you're taking?
Various ML approaches weight the training data differently. The most successful of the latest round of AI manages to compute weights across everything ever written --- hence more useful than naive KNN which wouldnt terminate on 1PB of text.
> NNs are decision trees anyway -- take any classification alg and rewind from its decision points into a disjunction of conditions
By that argument, every computation can be reduced to a lookup table. Take every possible input, memorize the correct output and store it in a database of sorts.
If decision trees were truly equivalent to NNs, you would be able to solve any problem currently addressed with NNs but using only decision trees without learning from the output of the NN. Same input datasets same output quality metrics.
Not really feasible, is it?
Likewise with all the other equivalences you made here.
2 replies →
> Fully connected neural networks are hierarchies of logistic regression nodes.
Only if you use softmax ss your activation function.
You mean sigmoid activation function?
1 reply →
SVMs are randomly initialized (with arbitrary priors) and then are deterministic.
From "What Is the Random Seed on SVM Sklearn, and Why Does It Produce Different Results?" https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s... :
> When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.
> The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the code
From "Random Initialization For Neural Networks : A Thing Of The Past" (2018) https://towardsdatascience.com/random-initialization-for-neu... :
> Lets look at three ways to initialize the weights between the layers before we start the forward, backward propagation to find the optimum weights.
> 1: zero initialization
> 2: random initialization
> 3: he-et-al initialization
Deep learning: https://en.wikipedia.org/wiki/Deep_learning
SVM: https://en.wikipedia.org/wiki/Support_vector_machine
Is it guaranteed that SVMs converge upon a solution regardless of random seed?
An SVM is a quadratic program, which is convex. This means that they should always converge and they should always converge to the same global optimum, regardless of initialization, as long as they are feasible, I.e. as long as the two classes can be separated by an SVM.
The soft-margin SVM which can handle misclassifications is also convex and has a unique global optimum [0].
[0] https://stackoverflow.com/a/12610455/992102
> as long as the two classes can be separated by an SVM.
Are the classes separable with e.g. the intertwined spiral dataset in the TensorFlow demo? Maybe only with a radial basis function kernel?
Separable state https://news.ycombinator.com/item?id=37369783
The article you’ve linked is incorrect. As Dr_Birdbrain said, fitting an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.
[0] https://github.com/scikit-learn/scikit-learn/blob/2a2772a87b...
[1] https://en.wikipedia.org/wiki/Platt_scaling
[2] https://scikit-learn.org/stable/modules/generated/sklearn.sv...
Which article is incorrect? Indeed it looks like there is no random initialization in libsvm or thereby sklearn.svm.SVC or in sklearn.svm.*. I seem to have confused random initialization in Simulated Annealing with SVMs; though now TIL that there are annealing SVMs, and SVMs do work with wave functions (though it's optional to map the wave functions into feature space with quantum state tomography), and that there are SVMs for the D-Wave Quantum annealer QC.
From "Support vector machines on the D-Wave quantum annealer" (2020) https://www.sciencedirect.com/science/article/pii/S001046551... :
Kernel-based support vector machines (SVMs) are supervised machine learning algorithms for classification and regression problems. We introduce a method to train SVMs on a D-Wave 2000Q quantum annealer and study its performance in comparison to SVMs trained on conventional computers. The method is applied to both synthetic data and real data obtained from biology experiments. We find that the quantum annealer produces an ensemble of different solutions that often generalizes better to unseen data than the single global minimum of an SVM trained on a conventional computer, especially in cases where only limited training data is available. For cases with more training data than currently fits on the quantum annealer, we show that a combination of classifiers for subsets of the data almost always produces stronger joint classifiers than the conventional SVM for the same parameters.
2 replies →
SVMs typically have weights per data point. I.e. nonparametric/hyper parametric. Modern machine learning doesn't really work like that anymore, right?
The weight per datapoint thing is actually kind of orthogonal to the concept of an SVM, but is conflated by most introductions to SVMs. SVMs are linear models using hinge loss. In the "primal" optimization perspective (rather than the dual problem SVMs are usually formulated as), one optimizes the feature weights like normal. This is not sparse in general, but it's not like dual SVM weights are particularly sparse in practice.
Totally. Thank you for expanding on "typically".
If I can expand on your "kind of", it would be that because of the kernel trick, it actually does matter that the data itself can determine the "linear" (in an infinite dimensional space, that would require infinitely many parameters under the primal formulation) model.
1 reply →
Yes SVM’s don’t store weights like parametric models but they also don’t store weights “per data point”. Only the points closest to the decision boundary are stored (i.e., the “support vectors”).
The attention matrix is computed based on all tokens in the context, so it kind of functions non-parametrically (but over the batch instead of over the whole training dataset)
I would love an Andrej video on this
Fuck, imagine how many doctoral theses I could've written every time I tweaked a few lines of code to try some abstract way of recombining outputs I didn't fully understand. I missed the boat. All this jargon is absolutely for show, though. Purely intended to create the impression that there's some kind of moat to the "discovery". There are much clearer ways to express "we fucked around with putting the outputs of this black box back into the inputs", but I guess that doesn't impress the rubes.
I really really wish this culture of expressing simple things in ornate ways would die. All it does is make knowledge less accessible
I think this applies to everything right now. Papers like this are just ridiculous examples. In like, 6th grade I won second place at the LA county science fair for coding a simulation of a coyote's life in hypercard (with tons of graphs). Yay. Y'know what? That shit and those graphs would've been incomprehensible to the judges if it hadn't been written in plain language, in an attempt to make them understand what they were looking at. My entire career since has been an attempt to communicate and alleviate the pain points in communication between parties, by way of writing software that encapsulated their descriptions of what they needed. And likewise I never pretended to be smarter or know more than my clients did: Everything must be explained and comprehensible in normal people language. People need to know how shit works, especially if they're paying for it.
Or they should.
Or if they don't know and don't care, they're fucking negligent.
Especially if they say "wow that sounds smart, let's let these guys run our weapons program".
To your point, the reason this ornate language thrives and people get away with complacency about how their own systems work, boils down to a silent pact between managers and engineers to sweep everything under the rug out of laziness and ill-will. There's something blatantly mendacious and evil (in the banal way) about the agreement that managers approve black boxes which were approved by complex-sounding papers so that upper management can wash their hands of the results.
[edit] maybe I'm just bitter because I spent hours today pondering exactly how many engineers at Monsanto must have known about the dangers of the astroturf, and how many raised their hand, or hid behind a spreadsheet
https://frontofficesports.com/investigation-links-astroturf-...
“Engineers who like to pretend to be mathematicians” - I heard once.
6 replies →
What I find entertaining/confounding is how difficult the abstracts to these new AI papers are to understand. It feels like academia is pushing this style, so it’s hard to blame the authors since they have to play the game.
For reference I have an undergrad degree in computer science, have been working professionally for 25 years, and am fairly data centric in my work.
I’m hoping when I run this through GPT4 to get an explanation for a mortal software developer something sensible comes out the other end.
"Not math-y enough"/ "Needs more math" is a very common feedback ML/AI researchers get when writing papers.
The other day I was watching a live-stream of a doctoral defense, as the thesis was quite relevant to my work.
So one of the committee members would really pick and criticize the math - ask questions like "You are supposed to be the bleeding edge on this topic, why was the math so simple? Did you research more rigorous theories to explain the math?" etc. (He was awarded the doctorate though)
So, I dunno, if that's how things are now - it makes sense to me that the authors go overboard with complicated notation, even if they could have written it much simpler. Probably makes the work seem more rigorous and legit.
Doesn't really take that much more time, and it covers your ass from "not rigorous enough" gotchas - though at the expense of readability.
1 reply →
If the excuse is true and the "ornate" language really is a dense representation of information then it should be fairly trivial to have an LLM agent unsummarize it.
There could be a webservice that offers a parallel track of layman's translations of any paper.
3 replies →
That's literally the entire field of philosophy after the ~18th century.
1 reply →
I wish also. When I was young and new, so much wasted time trying to parse the 'arcane' math that was really something simple bug dressed up as complicated to give it weight.
People have to get their PhD's somehow... ;)
Watching the AI community rediscover automatic differentiation 20+ years after the field was considered "mature" was equal parts frustrating & fascinating. The frustration was them rewriting the history of discovery, but without any sort of sense or rigor ... and it was also the most fascinating!
This is indeed the frustration
I'm waiting for some fresh group of grad students to make a breakthrough using a reinvented version of Pearls "Do" calculus or maybe they make some narrow breakthrough using BayesNets and everyone geeks out on those for a while
*I do think transformers (much like ff networks + backprop from 2012-2018) are probably a lasting software architecture for inference applications until we come up with new hardware, and move beyond GPU focused computing
It's exciting to see it all working, but disheartening how a-historical this last few years has been in AI - with the exception of Brooks, Sutton and a few other greybeards in the field who say similarly
2 replies →
The funny thing is that this constantly happens in every field ever, humans truly excel at repeating history without learning from the past.
Another example:
- HTML served by static file servers
- HTML generated by backend
- HTML enhanced with small JS snippets
- HTML generated by frontend, but served by backend
- Go to step one, not learning why anyone moved on from the previous method
2 replies →
I think the main motivation in ml theory that touches current SOTA is not "expressing simple ideas with a jargon for show". Jargon is necessary, as much as some (mostly very practical) engineers or software people cannot see it due to how unnecessary it seems to them (as they are used to practically and quickly express themselves). It's a jargon for the mathematics of machine learning, which is pretty unstandardized so to speak. So you need to define yourself. And without a jargon and clear proofs, what you do is just brainstorming at most. The value of such work is that their statement is pretty clear, proved and contain hypotheses which can be tested by the future papers.
Here is an example: to explain the existence of adversarial example, there are 2 suggestions without a jargon: 1) that the decision boundary is too nonlinear, 2) that the decision boundary is too linear. Both of these explanations contradict and stated without any real proof and unfortunately can be widely heard in most of the adversarial example papers. If we were to have clear formulations of these two statements, we could have tested both of these claims but unfortunately the papers that suggested these theories didn't put effort for defining a jargon and putting their suggestion as a clear-formal statement.
I studied using ML just over a decade ago. I actually compared MLPs to SVMs and had a similar thought to this. It does seem like there is a regression on understanding some of the fundamentals and older tools of the trade.
I guess everyone gets focused on the newer things.
Really does seem like people rediscovering older endpoints.
There's been a huge flood of vanilla software engineers into ML, retconning it as "a subfield of computer science" (computability is a minor concern compared to the statistical underpinnings). They pretend to know the math because they can read the equations, then claim with utmost confidence that actually they're doing all the hard work in ML because they are experts in calling APIs and integrating into products, however useful or useless.
13 replies →
That’s science. You can’t expect everyone to know everything. It’s a preprint so this is the first opportunity to provide feedback.
You think it's that easy, but, as frequently the case with transformers, I believe there's more here than meets the eye.
which jargon here is "just for show"?
> we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points
does this mean 'an over-parameterized transformer problem is a convex svm problem'?
4 replies →
Or, optimistically, this is really how they think about these things, and you should simply be happy they're not trying to obfuscate their findings.
It’s a preprint on arXiv, not the Magna Carta.
What are you talking about? Did we read the same abstract?
You're not wrong. Applied ML articles are not worth reading.
I wouldn’t go this far, applied ML articles are my favorite articles. If you’re in the arena, it’s good to see things that other people have done from a practical perspective so you can ape it in your own work or not give it further consideration.
[strike]Punk's[/strike] SVM's not dead!
(More seriously, it's good to find inroads to better formal understanding of what's happening in these systems.)
Universal function approximator == universal function approximator
Universal approximation theorem: https://en.wikipedia.org/wiki/Universal_approximation_theore...
NAND is a universal logic gate; from which all classical functions can be approximated.
CCNOT and Hadamard are universal logic gates with which all (?) quantum functions/transforms can be approximated.
Cubic splines are universal function approximators. As is piecewise linear interpolation. The universal function approximator thing is way overblown.
1 reply →
> CCNOT and Hadamard are universal logic gates with which all (?) quantum functions/transforms can be approximated.
CNOT, H, S, and T are universal for approximating any quantum operation.
1 reply →
All quantum functions can be approximated with a three layer perceptron, or by arbitary boolean logic.
Quantum computers can only compute - just like any other computer.
Turing machines can also be used as universal function approximators. But I'm not sure it makes sense to put them in the same category as the other two.
I would love to put them in the same category as the other two. In fact I’ve spent quite a lot if time thinking about it / experimenting. Wouldn’t it be great if we could somehow train on data and get a small Turing machine instead of a huge neural network?
5 replies →
Depends which valley it chooses to die on
Yup - and that's the big question for ML... "which valley will this algorithm die on" so far I haven't seen an answer.
Transformers as voltage amplifiers.
Transformers as toy vehicles that can turn into robots.
And robots in disguise
my tldr: this explains
1) why huge models are important (so the gradient is high-dimensional enough to be monotonic)
2) why attention (aka connections, aka indirections) is trainable at all;
and says nothing about why they might generalize the dataset
All machine learning is about finding hyperplanes.
This is wrong!
The term hyperplane already assumes that the hypothesis space that your learning algorithm searches has some kind of dimension and is some variant of an Euclidean / vector space (and its generalisations). This is not the case for many forms of ML, for example grammar induction (where the hypothesis space is Chomsky-style grammars) or inductive logic programming (hypothesis space are Prolog (or similar) programs), or, more generally, program synthesis (where programs form the hypothesis space).
It can also just be some sort of partitioning. I would be really surprised if there was no partitioning of some space.
2 replies →
Hyperplanes is all you need
Someday I'm going to write a paper that achieves SOTA results with a nigh-incomprehensible mishmash of diverse techniques and title it "All You Need Considered Harmful".
Hyperplanation is all you need
The large dimensionality seems to be what creates the need for heuristic designs rather than a generic approach
Hyperplanes are the heuristics.
What is hyperplane ?
For 2d, a line, for 3d a plane, for nd a hyperplane.
https://en.wikipedia.org/wiki/Hyperplane
A hyperplane is a multi-dimensional linear function that splits space into two distinct regions. In the context of a classifier, it splits feature space into disjunct sub-spaces (one for each class). SVMs effectively place a hyperplane with maximum margin, thereby separating classes in an optimal way.
1 reply →
A subspace of dimension n-1 of a n-dimensional vector space. It is an extension of the well-known concept of a 2d-plane in a 3d-space to nd-spaces.
4 replies →
a partition of Euclidean space into two convex sets ;)
it's a word (a made up word)
3 replies →
and multi-dimensional topological manifolds, maybe :)
[flagged]
You missed the best part where they think they're coming for our jobs.
I regret to inform you, I don’t think it’s the same set of people. Writing a cute “Transformers are SVMs” paper and “building chatGPT” are not the same skillset.
Did you read the abstract?
Is this an April joke?