Comment by jmalicki

12 hours ago

The properties that the uniform approximation theorem proves are not unique to neural networks.

Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).

So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.

Extremely well said. Universal approximation is necessary but not sufficient for the performance we are seeing. The secret sauce is implicit regularization, which comes about analogously to enforcing compression.

  • @hodgehog11 The grokking phenomenon (Power et al. 2022) is a puzzle for the compression view: models trained on algorithmic tasks like modular arithmetic memorize training data first (near-zero training loss, near-random test accuracy) and then, after many more gradient steps, suddenly generalize. The transition happens long after any obvious compression pressure would have fired. Do you think grokking is consistent with implicit regularization as compression, or does it require a separate mechanism - something more like a phase transition in the weight norms or the Fourier frequency structure?

    • >Do you think grokking is consistent with implicit regularization as compression

      Pretty sure it's been shown that grokking requires L1 regularization which pushes model parameters towards zero. This can be viewed as compression in the sense of encoding the distribution in the fewest bits possible, which happens to correspond to better generalization.

      1 reply →

Whenever people bring this up I like to remind them that linear interpolation is a universal function approximator.

  • Can you expand on that?

    • I'll use 1NN as the interpolation strategy instead since I think it illustrates the same point and saves a few characters.

      Recap: 1NN says that given a query Q you choose any pair (X,Y) from your learned "model" (a finite set of (X,Y) pairs) M minimizing |Q-X|. Your output is Y.

      The following kind of argument works for linear interpolation too (you can even view 1NN as 1-point interpolation), but it's ever so slightly messier since definitions vary a fair bit, you potentially need to talk about the existence of >1 discrete "nearest" or "enclosing" set of neighbors, and proving that you can get away with fewer points than 1NN or have lower error than 1NN is itself also messier.

      Pick your favorite compact-domain, continuous function embedded in some Euclidean space. For any target error you'd like to hit, the uniform continuity of that function guarantees that if your samples cover the domain well enough (no point in the domain is greater than some fixed distance, needing smaller distances for lower errors, from some point in your model) then the maximum error from a 1NN strategy is bounded by the associated error given by uniform continuity (which, again, you can make as small as you'd like by increasing the sampling resolution). The compact domain means you can physically achieve those error bounds with finite sample sizes.

      For a simple example, imagine fitting more and more, smaller and smaller, line segments to y=x^2 on [-1,1].

Universal approximation is like saying that a problem is computable

sure, that gives some relief - but it says nothing in practice unlike f.e. which side of P/NP divide the problem is on

  • > unlike f.e. which side of P/NP divide the problem is on

    Actually the P/NP divide is a similar case in my opinion. In practice a quadratic algorithm is sometimes unacceptably slow and an NP problem can be virtually solved. E.g. SAT problems are routinely solved at scale.

    • An NP problem can contain subproblems that are not worst case problems.

      It's similar to the gap between pushdown automata and Turing machines. You can check if pushdown automata will terminate or not. You can't do it for Turing machines, but this doesn't stop you from running a pushdown automata algorithm on the turning machine with decidable termination.