When I first saw scaling laws in that deep speech experiment notebook, I didn’t believe it could be real. I was worried for months that we made a mistake, or that it only worked for that one dataset.
I started to believe it after we (Joel Hestness in particular) reproduced it in so many experiments in “scaling is predictable empirically”.
The OpenAI work replicated it in a completely different environment, and at that point I was sure it was real.
Sometimes people ask me why I was so surprised by it. Prior work like Banko and Brill and the unreasonable effectiveness of data argued for more data. ML theory had similar models for toy problems, eg coin flips.
At the time I thought deep learning was supposed to be complex. Speech and language datasets seemed much more complex than toy problems. Optimization of deep transformers was complex.
The idea that it was possible for the whole thing to be governed by a 3 term equation seemed too simple. The implication was that it was simple to manufacture intelligence.
Ten years later, I still think it is still the most interesting observation I have seen. We are still learning what it looks like to live in a world where it is possible to manufacture intelligence.
the scaling laws work within a "generation". but what about across them?
GPT-3 was 175B, models like Gemma4 with 31B vastly outperform it, so there is more to it
as Karpathy noted, the initial GPTs were trained on complete garbage (literally, the average document from the Common Crawl is random nonsense), yet they worked. now we can use present LLMs to curate the data for the next generation
I dunno if you've seen the subreddit, "Sub Simulator GPT2", but I found it around 2020-2021. It seemed to contain GPT2-style models trained/finetuned on several popular subreddits, talking to each other as stereotypical regulars of each sub would. Most of the replies were fairly coherent and somewhat related to the "thread topic", but of course even GPT3.5 would make all of them look beyond drunk only a few years later. I already had a vague understanding of neural networks and the advances in image processing at the time, but couldn't have predicted where we are now. I wonder what it'll look like in a few more years as we continue how to learn how to make this capability useful and reliable, and hopefully sometimes keep finding additional conscionable entertainment and educational applications.
I really wish more people skeptical of AI capabilities would read about scaling laws -- Lilian is always so marvelous at giving a deep overview of the technical side but the whole point of this is: there are scaling laws, and they hold and continue to hold. This is such a huge basis for the predictions about AI capabilities for the past like 5 years.
Why should the skeptics be reading it? The scaling laws show diminishing returns on more training data and larger models.
From the Kaplan scaling laws paper:
> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.
So the skeptics are right to be skeptical of LLMs being all you need for continued advancement in this space. It seems like the believers are the ones who need to learn about the scaling laws.
And sitting right next to the data and compute factors in every cross entropy loss equation is the entropy of the language, which is just a fixed constant. There’s such a hard cap on cross entropy loss training and I never hear it come up!
Right but that is context dependent; it drops with context length, depends on tokenizer, etc. It doesn't end up being super relevant, despite the fact that if you look at the loss for real models it's relatively large in absolute terms. But that doesn't really matter -- all of the interesting stuff happens once you start getting closer and closer to it. You've gotten past all of the easy tokens that dominate the entropy and now you get to the really challenging ones that we care about (like e.g. very difficult reasoning about a next step).
When I first saw scaling laws in that deep speech experiment notebook, I didn’t believe it could be real. I was worried for months that we made a mistake, or that it only worked for that one dataset.
I started to believe it after we (Joel Hestness in particular) reproduced it in so many experiments in “scaling is predictable empirically”.
The OpenAI work replicated it in a completely different environment, and at that point I was sure it was real.
Sometimes people ask me why I was so surprised by it. Prior work like Banko and Brill and the unreasonable effectiveness of data argued for more data. ML theory had similar models for toy problems, eg coin flips.
At the time I thought deep learning was supposed to be complex. Speech and language datasets seemed much more complex than toy problems. Optimization of deep transformers was complex.
The idea that it was possible for the whole thing to be governed by a 3 term equation seemed too simple. The implication was that it was simple to manufacture intelligence.
Ten years later, I still think it is still the most interesting observation I have seen. We are still learning what it looks like to live in a world where it is possible to manufacture intelligence.
the scaling laws work within a "generation". but what about across them?
GPT-3 was 175B, models like Gemma4 with 31B vastly outperform it, so there is more to it
as Karpathy noted, the initial GPTs were trained on complete garbage (literally, the average document from the Common Crawl is random nonsense), yet they worked. now we can use present LLMs to curate the data for the next generation
I dunno if you've seen the subreddit, "Sub Simulator GPT2", but I found it around 2020-2021. It seemed to contain GPT2-style models trained/finetuned on several popular subreddits, talking to each other as stereotypical regulars of each sub would. Most of the replies were fairly coherent and somewhat related to the "thread topic", but of course even GPT3.5 would make all of them look beyond drunk only a few years later. I already had a vague understanding of neural networks and the advances in image processing at the time, but couldn't have predicted where we are now. I wonder what it'll look like in a few more years as we continue how to learn how to make this capability useful and reliable, and hopefully sometimes keep finding additional conscionable entertainment and educational applications.
Scaling laws assume the error metric and data distribution.
There is a lot of follow on work that explains what happens as you change them, e.g. Scaling Laws for Transfer - https://arxiv.org/pdf/2102.01293
I think it’s fortunate that transfer works in a similar way.
Common crawl (and Reddit, stack overflow, etc but not 4chan) was much easier to get access to at the time than using mechanical Turk.
There is certainly room for more work. There were many papers on scaling laws in NeurIPS this year.
Jeff Dean has a paper in 2007 that has proto scaling law plots for ngram language models.
https://aclanthology.org/anthology-files/anthology-files/pdf...
I really wish more people skeptical of AI capabilities would read about scaling laws -- Lilian is always so marvelous at giving a deep overview of the technical side but the whole point of this is: there are scaling laws, and they hold and continue to hold. This is such a huge basis for the predictions about AI capabilities for the past like 5 years.
Why should the skeptics be reading it? The scaling laws show diminishing returns on more training data and larger models.
From the Kaplan scaling laws paper:
> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.
So the skeptics are right to be skeptical of LLMs being all you need for continued advancement in this space. It seems like the believers are the ones who need to learn about the scaling laws.
And sitting right next to the data and compute factors in every cross entropy loss equation is the entropy of the language, which is just a fixed constant. There’s such a hard cap on cross entropy loss training and I never hear it come up!
Right but that is context dependent; it drops with context length, depends on tokenizer, etc. It doesn't end up being super relevant, despite the fact that if you look at the loss for real models it's relatively large in absolute terms. But that doesn't really matter -- all of the interesting stuff happens once you start getting closer and closer to it. You've gotten past all of the easy tokens that dominate the entropy and now you get to the really challenging ones that we care about (like e.g. very difficult reasoning about a next step).
5 replies →