Comment by andbberger
13 hours ago
https://en.wikipedia.org/wiki/Universal_approximation_theore...
the better question is why does gradient descent work for them
13 hours ago
https://en.wikipedia.org/wiki/Universal_approximation_theore...
the better question is why does gradient descent work for them
The properties that the uniform approximation theorem proves are not unique to neural networks.
Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).
So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.
Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.
@hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?
2 replies →
Whenever people bring this up I like to remind them that linear interpolation is a universal function approximator.
Can you expand on that?
1 reply →
Universal approximation is like saying that a problem is computable
sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on
> unlike f.e. which side of P/NP divide the problem is on
Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.
1 reply →
Interestingly, there exist problems which provably can't be learned via gradient descent for them.
I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.
It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.
A funny thing is, in very high-dimensional space, like millions and billions of parameters, the chance that you'd get stuck in a local minima is extremely small. Think about it like this, to be stuck in a local minima in 2D, you only need 2 gradient components to be zero, in higher dimension, you'd need every single one of them, millions up millions of them, to be all zero. You'd only need 1 single gradient component to be non-zero and SGD can get you out of it. Now, SGD is a stochastic walk on that manifold, not entirely random, but rather noisy, the chance that you somehow walk into a local minima is very very low, unless that is a "really good" local minima, in a sense that it dominates all other local minimas in its neighborhood.
6 replies →
Not a mathematician so I’m immediately out of my depth here (and butchering terminology), but it seems, intuitively, like the presence of a massive amount of local minima wouldn’t really be relevant for gradient descent. A given local minimum would need to have a “well” at least be as large as your step size to reasonably capture your descent.
E.g. you could land perfectly on a local minima but you won’t stay the unless your step size was minute or the minima was quite substantial.
2 replies →