← Back to context

Comment by nok22kon

4 hours ago

the scaling laws work within a "generation". but what about across them?

GPT-3 was 175B, models like Gemma4 with 31B vastly outperform it, so there is more to it

as Karpathy noted, the initial GPTs were trained on complete garbage (literally, the average document from the Common Crawl is random nonsense), yet they worked. now we can use present LLMs to curate the data for the next generation

I dunno if you've seen the subreddit, "Sub Simulator GPT2", but I found it around 2020-2021. It seemed to contain GPT2-style models trained/finetuned on several popular subreddits, talking to each other as stereotypical regulars of each sub would. Most of the replies were fairly coherent and somewhat related to the "thread topic", but of course even GPT3.5 would make all of them look beyond drunk only a few years later. I already had a vague understanding of neural networks and the advances in image processing at the time, but couldn't have predicted where we are now. I wonder what it'll look like in a few more years as we continue how to learn how to make this capability useful and reliable, and hopefully sometimes keep finding additional conscionable entertainment and educational applications.

Scaling laws assume the error metric and data distribution.

There is a lot of follow on work that explains what happens as you change them, e.g. Scaling Laws for Transfer - https://arxiv.org/pdf/2102.01293

I think it’s fortunate that transfer works in a similar way.

Common crawl (and Reddit, stack overflow, etc but not 4chan) was much easier to get access to at the time than using mechanical Turk.

There is certainly room for more work. There were many papers on scaling laws in NeurIPS this year.