Comment by eldenring
4 hours ago
This is a common way of thinking. In practice this type of thing is more like optimizing flop allocation. Surely with an infinite compute and parameter budget you could have a better model with more intensive operations.
Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers.
No comments yet
Contribute on Hacker News ↗