Comment by rstuart4133
1 month ago
The "bitter lesson" is self evidently true in one way as was a quantum jump in what AI's could do once we gave them enough compute. But as a "rule of AI" I think it's being over generalised, meaning it's being used to make predictions where it doesn't apply.
I don't see how the bitter lesson could not be true for the current crop of LLM's. They seem to have memorised just about everything mankind has written down, and squished it into something of the order of 1TB. You can't do that without a lot of memory to recognise the common patterns and eliminate them. The underlying mechanism is nothing like the zlib's deflate but when it comes to memory you have to throw at it they are the same in this respect. The bigger the compression window the better deflate does. When you are trying to recognise all the pattens in everything humans have written down to a deep level (such as discovering the mathematical theorems are generally applicable), the memory window and/or compute you have to use must be correspondingly huge.
That was also true to a lesser extent when Deep Mind taught an AI to play pong in 2013. They had 1M of pixels arriving 24 times a second, and it had to learn to pick out balls, bats and balls in that sea of data. It's clearly going to require a lot of memory and compute to do that. Those resources simply weren't available on a researchers budget much before 2013.
Since 2013, we've asked our AI's to ingest larger and larger datasets using the much same techniques used in 2013 (but known long before that) and been enchanted with the results. The "bitter lesson" predicts you need correspondingly more compute and memory to compress those datasets. Is it really a lesson, or engineering rule of thumb that only became apparent when we had enough compute to do anything useful with AI?
I'm not sure this rule of thumb has much applicability outside of this "lets compress enormous amounts of data, looking for deep structure" realm. That's because if we look at neural networks in animals, most are quite small. A mosquito manages to find us for protein, find the right plant sap for food, find a mate, find water with enough algae for it's eggs, using data from vision, temperature sensors, and smell, and uses that to activate wings, legs and god knows what else. It does all that with 100,000 neurons. That's not what a naive reading of "the bitter lesson" tells you it should take.
Granted it may take an AI of enormous proportions to discover how to do it with 100,000 neurons. Nature did it by iteratively generating trillions upon trillions of these 100,000 neurons networks over millennia, and used a genetic algorithm to select the best at each step. If we have to do it that way it will be a very bitter lesson. The 10 fold increases in compute every few years that made us aware of the bitter lesson is ending. If the prediction of the bitter lesson is that we have rely on it continuing to build our mosquito emulation, then it's predicting it will take us centuries to build all the sorts of robots we need to do all the jobs we have.
But that's looking unlikely. We have an example. On one hand we have Tesla FSD, using throwing more and more resources an conventional AI training in the way the bitter lesson says you must do in order to progress. On the other we have Waymo using a more traditional approach. It's pretty clear which approach is failing and the other is working - and it's not going the way the bitter lesson says it should.
> We have an example. On one hand we have Tesla FSD, using throwing more and more resources an conventional AI training in the way the bitter lesson says you must do in order to progress. On the other we have Waymo using a more traditional approach. It's pretty clear which approach is failing and the other is working - and it's not going the way the bitter lesson says it should.
As I understand the article, it is going the way the bitter lesson predicts it would - the initial "more traditional" approach generates almost-workable solutions in the near term while the "bitter lesson" approach is unreliable in the near term.
Unless you think that FSD is already in the "far" term (i.e. already at the endgame), this is exactly what the article predicts happens in the near term.