Comment by hodgehog11

15 hours ago

As someone who works in the area, this provides a decent summary of the most popular research items. The most useful and impressive part is the set of open problems at the end, which just about covers all of the main research directions in the field.

The skepticism I'm seeing in the comments really highlights how little of this work is trickling down to the public, which is very sad to see. While it can offer few mathematical mechanisms to infer optimal network design yet (mostly because just trying stuff empirically is often faster than going through the theory, so it is more common to retroactively infer things), the question "why do neural networks work better than other models?" is getting pretty close to a solid answer. Problem is, that was never the question people seem to have ever really been interested in, so the field now has to figure out what questions we ask next.

I'm constantly surprised how many people are critical of research to understand neural nets, immediately telling me they are black boxes and hopeless to understand. I believe it's a consequence of being portrayed as the opposite of (classically interpretable) linear regression.

Many people additionally have little patience for research when the engineering is moving so quickly. Even many interpretability researchers give up far too soon if research doesn't yield immediately gratifying results.

We’re in a strange era where the Information-Theoretic foundations of deep learning are solidifying. The 'Why' is largely solved: it’s the efficient minimization of irreversible information loss relative to the noise floor. There is so much waste scaling models bigger and bigger when the math points to how to do it much more efficiently. One can take a great 70B model and have it run in only ~16GB with no loss in capability and the ability to keep training, but the last few years funding only went for "bigger".

As you noted, the industry has moved the goalposts to Agency and Long-horizon Persistence. The transition from building 'calculators that predict' to 'systems that endure' is a non-equilibrium thermodynamics problem. There is math/formulas and basic laws at play here that apply to AI just as much as it applies to other systems. Ironically it is the same math. The same thing that results in a signal persisting in a model will result in agents persisting.

This is my specific niche. I study how things persist. It’s honestly a bit painful watching the AI field struggle to re-learn first principles that other disciplines have already learned. I have a doc I use to help teach folks how the math works and how to apply it to their domain and it is fun giving it folks who then stop guessing and know exactly how to improve the persistence of what they are working on. Like the idea of "How many hours we can have a model work" is so cute compared to the right questions.

  • > It’s honestly a bit painful watching the AI field struggle to re-learn first principles that other disciplines have already learned.

    This is my fear with software development in general. There's a hundred-year old point of view right next door that'll solve problems and I'm too incurious to see it.

    I have a relative with a focus in math education that I've been stealing ideas from, and I think we'd both appreciate a look at your doc if you don't mind.

"why do neural networks work better than other models?" That sounds really interesting - any references (for a non specialist)?

  • https://en.wikipedia.org/wiki/Universal_approximation_theore...

    the better question is why does gradient descent work for them

    • The properties that the uniform approximation theorem proves are not unique to neural networks.

      Any models using an infinite dimensional Hilbert space, such as SVMs with RBF or polynomial kernels, Gaussian process regression, gradient boosted decision trees, etc. have the same property (though proven via a different theorem of course).

      So the universal approximation theorem tells us nothing about why should expect neural networks to perform better than those models.

      9 replies →

    • I don't follow. Why wouldn't it work? It seems to me that a biased random walk down a gradient is about as universal as it gets. A bit like asking why walking uphill eventually results in you arriving at the top.

      11 replies →

Do neural networks work better than other models? They can definitely model a wider class of problems than traditional ML models (images being the canonical example). However, I thought where a like for like comparison was possible they tend to worse than gradient boosting.

  • Gradient boosting handles tabular data better than neural networks, often because the structure is simpler, and it becomes more of an issue to deal with the noise. You can do like-to-like comparisons between them for unstructured data like images, audio, video, text, and a well-designed NN will mop the floor with gradient boosting. This is because to handle that sort of data, you need to encode some form of bias around expected convolutional patterns in the data, or you won't get anywhere. Both CNNs and transformers do this.

    • Would you agree/disagree with the following:

      - It's not gradient boosting per se that's good on tabular data, it's trees. Other fitting methods with trees as the model are also usually superior to NNs on tabular data.

      - Trees are better on tabular data because they encode a useful inductive bias that NNs currently do not. Just like CNNs or ViTs are better on images because they encode spatial locality as an inductive bias.

      2 replies →

In my opinion current research should focus on revisiting older concepts to figure out if they can be applied to transformers.

Transformers are superior "database" encodings as the hype about LLMs points out, but there have been promising ML models that were focusing on memory parts for their niche use cases, which could be promising concepts if we could make them work with attention matrixes and/or use the frequency projection idea on their neuron weights.

The way RNNs evolved to LSTMs, GRUs, and eventually DNCs was pretty interesting to me. In my own implementations and use cases I wasn't able to reproduce Deepmind's claims in the DNC memory related parts. Back at the time the "seeking heads" idea of attention matrixes wasn't there yet, maybe there's a way to build better read/write/access/etc gates now.

[1] a fairly good implementation I found: https://github.com/joergfranke/ADNC

> why do neural networks work better than other models

The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters.

  • No it isn't, and it's frustrating when the "common wisdom" tries to boil it down to this. If this was true, then the models with "infinitely many" parameters would be amazing. What about just training a gigantic two-layer network? There is a huge amount of work trying to engineer training procedures that work well.

    The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.

    • > The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.

      That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.

      2 replies →

  • Also massive human work done on them, that wasn't done before.

    Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that.

  • normally more parameters leads to overfitting (like fitting a polynomial to points), but neural nets are for some reason not as susceptible to that and can scale well with more parameters.

    Thats been my understanding of the crux of mystery.

    Would love to be corrected by someone more knowledgable though

    • This absolutely was the crux of the (first) mystery, and I would argue that "deep learning theory" really only took off once it recognized this. There are other mysteries too, like the feasibility of transfer learning, neural scaling laws, and now more recently, in-context learning.