Comment by ozb
2 years ago
Indeed, transformers are just another universal approximator; it doesn't matter exactly what a particular attention head does, whether it's operating as a continuous associative array or kernel smoothing, or simulating a higher-dimensional vector space which exhibits monosemanticity. What OP misses is that in addition to being universal, all that matters is that it's efficiently trainable, and in particular on GPUs and in parallel; that is what makes it better than LZ or any other universal approximator; all else is secondary. If you can make LZ (or anything else) work significantly more efficiently than transformers on GPUs, you can found the next OpenAI and be a billionaire.
No comments yet
Contribute on Hacker News ↗