Comment by jrmg

11 hours ago

You can see the fall in real time - half the sources are also dubious AI slop now and that number’s only growing :-/

At work the conversation is that simultaneously everyone is using LLMs now, yet we receive virtually no traffic through them. The LLMs scrape our data, provide an answer to the user, and we see nothing from it.

  • I have the same worry about LLMs in general - I know that ‘model collapse’ seems to be an unfashionable idea, but when the internet’s just full of garbage (soon?…), what are we going to train these things on?

    • They moved away from raw text and are now working with verifiable synthetic data (eg math, games, code) to improve general reasoning.

  • How often are they scraping?

    Also generally wondering… Do labs view scraping as legally safer than trying to cache the Internet? I figure it’s easy to mark certain content as all but evergreen (can do a quick secondary check for possible new news).

    Maybe caching everything is too expensive?