Comment by geokon

7 hours ago

normally more parameters leads to overfitting (like fitting a polynomial to points), but neural nets are for some reason not as susceptible to that and can scale well with more parameters.

Thats been my understanding of the crux of mystery.

Would love to be corrected by someone more knowledgable though

This absolutely was the crux of the (first) mystery, and I would argue that "deep learning theory" really only took off once it recognized this. There are other mysteries too, like the feasibility of transfer learning, neural scaling laws, and now more recently, in-context learning.