Comment by eldenring

4 hours ago

This is a common way of thinking. In practice this type of thing is more like optimizing flop allocation. Surely with an infinite compute and parameter budget you could have a better model with more intensive operations.

Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers.