Comment by ActorNightly
10 hours ago
The whole reason for the first "AI Winter" was because people were trying to solve problems with smaller neural nets, and of course you run into problems during training, where you can't get things to converge.
Once compute became more available, and you had more neural nets, and thus more dimensionality (in the sense of layer sizes), during training, you had more directions for gradient descent, so things started happening with ML.
And all the architectures that you see today are basically simplifications of the fully connected layers with max dimensionality. Any operation like attention, self attention, or convolution can be unrolled into matrix multiples.
I wouldn't be surprised if Google TPUs basically do this. It seems to reason that they are the most efficient because they don't move memory around, which means that the matrix multiply circuitry is hard wired, which means that the compiler basically has to lay out the locations of the data in the spaces that are meant to be matrix multiplied together, so the compiler probably does that unrolling under the hood.
No comments yet
Contribute on Hacker News ↗