Comment by smallmancontrov

4 months ago

This is the case with most clever neural architectures: in theory, you could always replace them with dense layers that would perform better with enough resources/training, but that's just it, efficiency matters (number of parameters, training data, training time, FLOPS) and dense layers aren't as efficient (to put it mildly).

You have seen this play out on a small scale, but if you calculate the size of the dense layers necessary to even theoretically replicate a big attention layer or even convolution, to say nothing of the data needed to train them without the help of the architecture's inductive bias, you will see that the clever architectures are quite necessary at scale.

0 comments

smallmancontrov

No comments yet

Contribute on Hacker News ↗