← Back to context

Comment by kouteiheika

1 month ago

> More computation runs the risk of overfitting and there just isn’t any more data.

At this scale you can't overfit. The model might not improve in a meaningful way, but it can't overfit because the amount of data is much much larger that the size of the model.

That said, as the state of the art open models show, the way to get better models is not "use more data", but use "more high quality data" (e.g. look at the graphs here comparing the datasets: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). I don't know how much high quality data we can extract from the Internet (most of the data on the Internet is, as you'd guess, garbage, which is why aggressively filtering it can improve performance so much), but I'd wager we're still nowhere near running out.

There's also a ton of data that's not available on the public Internet in an easily scrapable form, or even at all; e.g. Anna's Archive apparently contains almost 1PB of data, and even that is (by their estimate) only ~5% of the world's books.

I'd think that this only means that a model cannot suffer from overfitting on average. So, it might totally have been overfitted on your specific problem.