Comment by regularfry
3 years ago
Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I'm guessing it can't reduce the number of computations purely from an information theory argument, but doesn't it imply a radically simpler and easier to implement architecture?
I just read the abstract so could be way off, but sounds more like one of those papers that connect seemingly different mathematical formalisms and show their equivalence (often under some restrictions). Typically they don’t give us much immediate benefit in terms of implementation, but they add to the intuitive understanding of what we’re doing, and sometimes helps others make more practical progress.
I'm not an expert in this, so hopefully someone more knowledgeable can weight in - but SVMs are understood much better from the perspective over overfitting and things like the VC bound - while Transformers are not really understood as well. From what I remember it's quite easy to have a SVM overfit, while Transformers have fewer issues. It'd be interesting to understand why
So if the two are somehow connected, then that could have implications for tuning and fighting overfitting
maybe it'd also be possible to design better non-overfitting SVMs
> From what I remember it's quite easy to have a SVM overfit ... It'd be interesting to understand why
SVMs with well-tuned kernels and regularization are reasonably resistant to overfitting. The problem is that you can easily end up overfitting the hyperparameters if you're not very careful about how you do performance testing.
Those equivalences can connect two different fields and allows to transfer methods from one field to the other. Each field usually has developed quite a number of methods and tricks over the time. So when this work shows that they are equivalent (with restrictions), you can maybe use some of the tricks of SVMs and try to use them to improve the Transformer model or its training.
Otherwise, they just help us in better understanding Transformers and SVMs.
There have been similar equivalences before, for example:
Linear Transformers Are Secretly Fast Weight Programmers, https://arxiv.org/abs/2102.11174
Or policy gradient methods from reinforcement learning are basically the same as sequence-discriminative training as it was done for speech recognition since many years, however, they come with different tricks, and combining the tricks was helpful.
I am waiting for someone publishing the theoretical limits of these "AI" systems. They're certainly impressive language models - don't get me wrong on that. But every algorithm and every model has its limits. To know the limits turns their application from hype into engineering. And of course, the hype-sellers will try to keep that from happening as long as possible.
Hey,
https://en.wikipedia.org/wiki/Universal_approximation_theore...
This theorem explain the limits, putting it in simple terms, most architectures are universal approximators that are constrained by the inductive bias that we give them, so far the approximator arquitectured that is less constrained by the inductive bias is the transformer, so it should be able to approximate any mathematical function, the current problem is that the attention mechanism have a quadratic scaling, so while is easy to scale it in text, is pretty hard to scale it with anything else to the same performance, even if not further discoveries are made, just with the computer power of the future it should be able to scale in every field, even with the techniques of today it gives pretty good performance in a lot of tasks.
This review of the paper an image is worth 16x16 words by Yannic Kilcher explains it better if you are interested.
https://youtu.be/TrdevFK_am4?t=1314
It’s entirely reasonable to desire boundaries between nothing and … the universal approximation theorem!
Hype sellers, despite being annoying and noisy, are not the reason why it's hard to figure out the theoretical limits.
To put it the form of a rhetorical question: many of these models are public, so why "wait" when you could do the research yourself?
> I am waiting for someone publishing the theoretical limits of these "AI" systems.
> To know the limits turns their application from hype into engineering.
It would be helpful to know how the models actually work under the hood.
But we made very good use of metals for thousands of years before we understood things like atoms, chemical bonds, lattices, etc.
Some engineering disciplines can be made up largely of empirical knowledge.
Engineering to me is "make the things we want out of the things we have", and not necessarily "design based on complete scientific theories".
I, as a Real Engineer, REFUSE to use ChatGPT until we have a working theory of quantum gravity. Enough of this bullshit where no one knows the fundamentals of what they’re working with.
What are the fundamental limits of language itself? Is English somehow more "emergent" than Korean? Isn't this more interesting than the actual execution mechanism?
The business of these new LLMs is next token prediction with context. This is also now a mission because it clearly works to some large extent. Where most would not have been willing to take a leap of faith prior, many can see some path now. I've been able to suspend my disbelief around language-as-computation long enough to discover new options.
You're looking for the universal approximation theorem. It's one of those cases where they can do anything in theory so the question is more are we chasing a turning tarpit or not, where everything is possible but nothing is easy