Comment by xp84

4 days ago

Our conventional wisdom is that the LLMs are trained on "the whole Internet" but I hope that isn't true and will be less and less true in the future. I fully agree with you that now when I decide to search Google for something (instead of asking an LLM) I find at least 50% of the sites in the results are just AI-generated generalities of the quality you and GP describe -- useless both to me and to any future model being trained, since they're just repeating things the previous models already knew.

AI ought to be trained on known-to-be-good stuff. Anything that simply trawls the whole net will tend to complete model collapse.