Comment by csmpltn
1 month ago
> "We’re nowhere near ingesting the whole internet."
We don't need to ingest the whole internet. I'd wager that upwards of 75% of the internet is spam, which would be useless for LLM training purposes. By the way, spam and useless information on the internet is only going to get worse, largely thanks to LLMs.
Only a subset of the internet contains "useful" information, an even a smaller subset contains information which is "clean enough" to be used for training purposes, and an even smaller subset can be legally scraped and used for training purposes.
It's highly likely that we've reached "peak training data" a long time ago, for many areas of knowledge and activities which are available on the internet.
No comments yet
Contribute on Hacker News ↗