Comment by aspenmartin
4 days ago
I really wish more people skeptical of AI capabilities would read about scaling laws -- Lilian is always so marvelous at giving a deep overview of the technical side but the whole point of this is: there are scaling laws, and they hold and continue to hold. This is such a huge basis for the predictions about AI capabilities for the past like 5 years.
Why should the skeptics be reading it? The scaling laws show diminishing returns on more training data and larger models.
From the Kaplan scaling laws paper:
> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.
So the skeptics are right to be skeptical of LLMs being all you need for continued advancement in this space. It seems like the believers are the ones who need to learn about the scaling laws.
And sitting right next to the data and compute factors in every cross entropy loss equation is the entropy of the language, which is just a fixed constant. There’s such a hard cap on cross entropy loss training and I never hear it come up!
Right but that is context dependent; it drops with context length, depends on tokenizer, etc. It doesn't end up being super relevant, despite the fact that if you look at the loss for real models it's relatively large in absolute terms. But that doesn't really matter -- all of the interesting stuff happens once you start getting closer and closer to it. You've gotten past all of the easy tokens that dominate the entropy and now you get to the really challenging ones that we care about (like e.g. very difficult reasoning about a next step).
My understanding is that the true entropy floor of a language is intractable- regardless of context length there will be “unpredictable” tokens where cross entropy loss is bound to happen. Even with infinite parameters and data you’ll still have a chance at failing to predict the next token correctly a decent chunk of the time.
Also, linear gains in context length scale quadratically with compute because of attention, so depending on context growth means taking a bath on GPUs for as long as you can, right?
4 replies →
Right, but do you understand what happens at that limit? A model that has a cross entropy at that limit for a data stream of text, produces a stream of text that is both theoretically and practically indistinguishable from the stream.
And so if the datastream has been produced by something intelligent, the resulting model is indistinguishable from that intelligence through the observed data. That is the whole compression idea behind artificial intelligence.
The limit is not a bug, it's a feature!
[flagged]