← Back to context

Comment by lamename

15 days ago

I watched a discussion the other day on this "NNs don't overfit point". I realize yes certain aspects are surprising, and in many cases with the right size and diversity in a dataset scaling laws prevail, but my experience with real datasets training from scratch (not fine tuning pretrained models), and impression has always been that NNs definitely can overfit if you don't have large quantities of data. My gut assumption is that original theories were not demonstrated to be true in certain circumstances (i.e. certain dataset characteristics), but that's never mentioned in shorthand these days when data sets size is often assumed to be huge.

(Before anyone laughs this off, this is still an actual problem in the real world for non-FAANG companies who have niche problems or cannot use open-but-non-commercial datasets. Not everything can be solved with foundational/frontier models.)

Please point me to these papers because I'm still learning.

Yes they can overfit. SLT assumed that this is caused by large VC dimension. Which apparently isn't true because there exist various techniques/hacks which effectively combat overfitting while not actually reducing the very large VC dimension of those neural networks. Basically, the theory predicts they always overfit, while in reality they mostly work surprisingly well. That's often the case in ML engineering: people discover things work well and others don't, while not being exactly sure why. The famous Chinchilla scaling law was an empirical discovery, not a theoretical prediction, because theories like SLT are far too weak to make interesting predictions like that. Engineering is basically decades ahead of those pure-theory learning theories.

> Please point me to these papers because I'm still learning.

Not sure which papers you have in mind. To be clear, I'm not an expert, just an interested layman. I just wanted to highlight the stark difference between the apparently failed pure math approach I learned years ago in a college class, and the actual ML papers that are released today, with major practical breakthroughs on a regular basis. Similarly practical papers were always available, just from very different people, e.g. LeCun or people at DeepMind, not from theoretical computer science department people who wrote text books like the one here. Back in the day it wasn't very clear (to me) that those practice guys were really onto something while the theory guys were a dead end.