Comment by wizzard0

3 years ago

my tldr: this explains

1) why huge models are important (so the gradient is high-dimensional enough to be monotonic)

2) why attention (aka connections, aka indirections) is trainable at all;

and says nothing about why they might generalize the dataset