Comment by tliltocatl
10 days ago
The "bitter lesson" is extrapolating from ONE datapoint where we were extremely lucky with Dennart scaling. Sorry, the age of silicon magic is over. It might be back - at some point, but for now it's over.
10 days ago
The "bitter lesson" is extrapolating from ONE datapoint where we were extremely lucky with Dennart scaling. Sorry, the age of silicon magic is over. It might be back - at some point, but for now it's over.
the way by which things will scale is not only limited to the optimization of low level hardware but also just by brute force investment and construction of massive data centers, which is absolutely happening.
It also ignores quite a lot of neural network architecture development that happened in the mean time.
The transformer architecture IS the bitter lesson. It lets you scale your way with more data and computational resources. It was only after the fact that people come up with bespoke algorithms that increase the efficiency of transformers through human ingenuity. Turns out a lot of the things transformers do is completely unnecessary, like the V cache, for example, but that doesn't matter in practice. Everyone is training their model with V caches, because they can start training their bleeding-edge LLM today, not after they did some risky engineering into a novel architecture.
The architectures before transformers were LSTM based RNNs. They suck because they don't scale. Mamba is essentially the successor to RNNs and its key benefit is that it can be trained in parallel (better compute scaling) and yet Mamba models are still losing out to transformers because the ideal architecture for Mamba based LLMs has not yet been discovered. Meanwhile the performance hit of transformers is basically just a question of how many dollars you're willing to part with.