← Back to context

Comment by chadcmulligan

14 hours ago

"why do neural networks work better than other models?" That sounds really interesting - any references (for a non specialist)?

https://en.wikipedia.org/wiki/Universal_approximation_theore...

the better question is why does gradient descent work for them

  • The properties that the uniform approximation theorem proves are not unique to neural networks.

    Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).

    So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.

    • Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.

      3 replies →

    • Universal approximation is like saying that a problem is computable

      sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on

      2 replies →

  • Interestingly, there exist problems which provably can't be learned via gradient descent for them.

  • I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.

    • It wouldn't work if your landscape has more local minima than atoms in the known universe (which it does) and only some of them are good. Neural networks can easily fail, but there's a lot of things one can do to help ensure it works.

      10 replies →