← Back to context

Comment by moomin

1 month ago

The thing is, what do we do with the bitter lesson once we’re essentially ingesting the entire internet? More computation runs the risk of overfitting and there just isn’t any more data. Is the bitter lesson here telling me that we’ve basically maxed out?

> More computation runs the risk of overfitting and there just isn’t any more data.

At this scale you can't overfit. The model might not improve in a meaningful way, but it can't overfit because the amount of data is much much larger that the size of the model.

That said, as the state of the art open models show, the way to get better models is not "use more data", but use "more high quality data" (e.g. look at the graphs here comparing the datasets: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). I don't know how much high quality data we can extract from the Internet (most of the data on the Internet is, as you'd guess, garbage, which is why aggressively filtering it can improve performance so much), but I'd wager we're still nowhere near running out.

There's also a ton of data that's not available on the public Internet in an easily scrapable form, or even at all; e.g. Anna's Archive apparently contains almost 1PB of data, and even that is (by their estimate) only ~5% of the world's books.

  • I'd think that this only means that a model cannot suffer from overfitting on average. So, it might totally have been overfitted on your specific problem.

We’re nowhere near ingesting the whole internet.

Though personally, I think we’re missing whatever architecture / mathematical breakthrough will make online learning (or even offline incremental, I.e. dreams) work.

At that point we could give the AI a robot body and train it of lived experience.

  • > "We’re nowhere near ingesting the whole internet."

    We don't need to ingest the whole internet. I'd wager that upwards of 75% of the internet is spam, which would be useless for LLM training purposes. By the way, spam and useless information on the internet is only going to get worse, largely thanks to LLMs.

    Only a subset of the internet contains "useful" information, an even a smaller subset contains information which is "clean enough" to be used for training purposes, and an even smaller subset can be legally scraped and used for training purposes.

    It's highly likely that we've reached "peak training data" a long time ago, for many areas of knowledge and activities which are available on the internet.

With humans, while you can't read the whole internet you can maybe read everything in a narrow niche. Then the thing is to go out and do something or make something. Maybe that's the future for AI.