Comment by cubefox
14 days ago
I have read parts of it years ago. As far as I remember, this is very theoretical (lots of statistical learning theory, including some IMHO mistaken treatment of Vapnik's theory of structural risk minimization), with strong focus on theory and basicasically zero focus on applications. Which would be completely outdated by now anyway, as the book is from 2014, an eternity in AI.
I don't think many people will want to read it today. As far as I know, mathematical theories like SLT have been of little use for the invention of transformers or for explaining why neural networks don't overfit despite large VC dimension.
Edit: I think the title "From theory to machine learning" sums up what was wrong with this theory-first approach. Basically, people with interest in math but with no interest in software engineering got interested in ML and invented various abstract "learning theories", e.g. statistical learning theory (SLT). Which had very little to do with what you can do in practice. Meanwhile, engineers ignored those theories and got their hands dirty on actual neural network implementations while trying to figure out how their performance can be improved, which led to things like CNNs and later transformers.
I remember Vapnik (the V in VC dimension) complaining in the preface to one of his books about the prevalent (alleged) extremism of focussing on practice only while ignoring all those beautiful math theories. As far as I know, it has now turned out that these theories just were far too weak to explain the actual complexity of approaches that do work in practice. It has clearly turned out that machine learning is a branch of engineering, not a branch of mathematics or theoretical computer science.
The title of this book encapsulates the mistaken hope that first people will learn those abstract learning theories, they get inspired, and promptly invent new algorithms. But that's not what happened. SLT is barely able to model supervised learning, let alone reinforcement learning or self-supervised learning. As I mentioned, they can't even explain why neural networks are robust to overfitting. Other learning theories (like computational/algorithmic learning theory, or fantasy stuff like Solomonoff induction / Kolmogorov complexity) are even more detached from reality.
I watched a discussion the other day on this "NNs don't overfit point". I realize yes certain aspects are surprising, and in many cases with the right size and diversity in a dataset scaling laws prevail, but my experience with real datasets training from scratch (not fine tuning pretrained models), and impression has always been that NNs definitely can overfit if you don't have large quantities of data. My gut assumption is that original theories were not demonstrated to be true in certain circumstances (i.e. certain dataset characteristics), but that's never mentioned in shorthand these days when data sets size is often assumed to be huge.
(Before anyone laughs this off, this is still an actual problem in the real world for non-FAANG companies who have niche problems or cannot use open-but-non-commercial datasets. Not everything can be solved with foundational/frontier models.)
Please point me to these papers because I'm still learning.
Yes they can overfit. SLT assumed that this is caused by large VC dimension. Which apparently isn't true because there exist various techniques/hacks which effectively combat overfitting while not actually reducing the very large VC dimension of those neural networks. Basically, the theory predicts they always overfit, while in reality they mostly work surprisingly well. That's often the case in ML engineering: people discover things work well and others don't, while not being exactly sure why. The famous Chinchilla scaling law was an empirical discovery, not a theoretical prediction, because theories like SLT are far too weak to make interesting predictions like that. Engineering is basically decades ahead of those pure-theory learning theories.
> Please point me to these papers because I'm still learning.
Not sure which papers you have in mind. To be clear, I'm not an expert, just an interested layman. I just wanted to highlight the stark difference between the apparently failed pure math approach I learned years ago in a college class, and the actual ML papers that are released today, with major practical breakthroughs on a regular basis. Similarly practical papers were always available, just from very different people, e.g. LeCun or people at DeepMind, not from theoretical computer science department people who wrote text books like the one here. Back in the day it wasn't very clear (to me) that those practice guys were really onto something while the theory guys were a dead end.
Theory is still needed if you want to understand things like variational inference (which is in turn needed to understand things like diffusion models). It’s just like physics - you need math theories to understand things like quantum mechanics, because otherwise it might not make sense.
I think machine learning research is more like engineering, where you do need some math, but you don't need a physics degree. You don't need to understand everything first to discover that some engineering solutions work and others don't. And most abstract theories likely wouldn't have helped you anyway because they are not sufficiently concrete to apply to what you are doing in practice.
To make some progress in ML you might not need a lot of theory, but to understand why things work – you absolutely do. Moreover, the DL field as a whole desperately needs theories explaining what’s going on in these large models.