← Back to context

Comment by refulgentis

17 hours ago

This is a beautifully written way of saying “Some parts of what the network memorizes affect test behavior, and some don’t.” But that’s not a theory of deep learning, the grand unified theory would explain that.

We're given a signal channel and a reservoir. Signal lives in the channel, noise lives in the reservoir, and the reservoir supposedly doesn’t show up at test time.

Okay, but then we have: why would SGD put the right things in the right bucket?

If the answer is “because the reservoir is defined as the stuff that doesn’t transfer to test,” then this is close to circular.

The Borges/Lavoisier stuff is a tell. "We have unified the field” rhetoric should come after nontrivial predictions and results. Claiming to solve benign overfitting, double descent, grokking, implicit bias, risk of training on population, how to avoid a validation set, and last but not least, skipping training by analytically jumping to the end is 6 theory papers, 3 NeurIPS winners, and a $10B startup. Let's get some results before we tell everyone we unified the field. :) I hope you're right.

> why would SGD put the right things in the right bucket?

Think of it as a best fit curve and exceptions to that curve. The noise is essentially this set of exceptions that move points away from where they would otherwise fall on the curve.

Gradient descent wants to be able to make the smallest change that moves the most data points towards the curve. To do this it learns an arrangement where it can change, say, one parameter and have a bunch of points move at once. What does this correspond to? The big common patterns shared by many data points.

Most of the capacity gets soaked up modelling these sorts of common patterns, and after they have been learned the model starts adding exceptions that allow individual points to deviate from the curve.

Because they’re exceptions, they must not impact neighbouring points, or at least only ones within a very short distance from them. Otherwise they’re now driving the error higher by impacting more points than they should. So you end up with very narrow ranges of features that are able to trigger different sorts of noise.

How narrow they are is shaped by the training data, they’re exactly as narrow as needed not to raise the error, so assuming the total population has the same distribution, they don’t get hit. Much.

At least, that’s what I take away from it.

Admittedly probably some aggrandized boasting here, but I think empirical verification of that Adam modification alone would be a meaningful contribution, unless that's prior work?

  • A theory that skips the parameter space, and understands grokking theory, comes up with an unexplained update rule, which notably works on a per-parameter level by dropping the updates for most parameters.

    I suspect there is going to be a lot of handwaving to actually go from eNTK to that new update rule.

    I also doubt it helps in the non-grokking regime, given the focus of the theory, which is where all the practical applications I have ever heard from live.

    Don't get me wrong, I did enjoy reading this essay. It's well written and reasonably argumented without going into details.

    • The handwaving required is just to assume a diagonal preconditioner, and the optimal preconditioner under that constraint corresponds to the new update rule. (See section F of the paper.) And of course a diagonal preconditioner works on the per-paramer level.

If that's the case, a way to test the theory and understanding (assuming some parts of reservoir and signal channel can be reliably identified) would be to prune the high-confidence reservoir significantly reducing the model size while still getting good predictions. I don't believe the authors mention this (though I skimmed and didn't read the full paper in detail so I may be wrong)

I don't know the math, but this point was clear to me and it screamed, "crank" but not being sure of that because I am not learned enough to understand the math... but even I could tell the magnitude of the claim. Even just the removing the need for validation sets would have epic consequences across many fields.

These are the same complaints I had. Also felt like it was high quality ai writing, possibly because of the style choices like "Benign overfitting is noise sitting in the reservoir at interpolation. XYZ is ..." and because of the similarity it has to the times I ended up with chatgpt or gemini creating very detailed and plausible reports about my own crackpot or vague-enough-to-be-useless ideas.

> The Borges/Lavoisier stuff is a tell.

Nah, the softer stuff seems like valuable outreach / good science communication for people that aren't up for the math. Including probably lots of software engineers who are sick of dumb debates in forums, and starting to dip into the real literature and listen to better authorities. More people should do this really, since it's the only way to see past the marketing and hype from fully entrenched AI boosters or detractors. Neither of those groups is big on critical thinking, and they dominate most conversation.

Time/effort coming from experts who want to make things accessible is a gift! The paper is linked elsewhere in the thread if you want no-frills.