Comment by prideout

17 hours ago

This is a fascinating mathematical framework, but the post title might be a bit of an overreach. I often wonder if "a theory of deep learning" could exist that could be stated succinctly and that could predict (1) scaling laws and (2) the surprising reliability of gradient descent.

Note that I said "predict" not "describe". It feels like we're still in the era of Kepler, not Newton.

5 comments

prideout

sdenton4 13 hours ago

I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializations is a starting point, but recurrent connections and batch/layer normalization go a very long way towards making it reliable.

hellohello2 12 hours ago
I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.
- sigmoid10 6 hours ago
  
  Gradient descent is mathematically the most efficient optimization strategy (safe for some special functions) in high dimensions. This goes so far that people nowadays even believe it has to be used in the human brain [1], if only because every other method of updating the brain would be way too energy inefficient. From that perspective, finding the right parameterization was all we ever needed to achieve AI.
  [1] https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP28...
  
  1 reply →

jldugger 8 hours ago

> but the post title might be a bit of an overreach.

Really? An essay that leads off with a Borges anecdote skewed grandiose. Oh my, how unprecedented!