← Back to context

Comment by BryantD

19 days ago

I can understand why LLM companies might want to crawl those diffs -- it's context. Assuming that we've trained LLM on all the low hanging fruit, building a training corpus that incorporates the way a piece of text changes over time probably has some value. This doesn't excuse the behavior, of course.

Back in the day, Google published the sitemap protocol to alleviate some crawling issues. But if I recall correctly, that was more about helping the crawlers find more content, not controlling the impact of the crawlers on websites.

The sitemap protocol does have some features to help avoid unnecessary crawling, you can specify the last time each page was modified and roughly how frequently they're expected to be modified in the future so that crawlers can skip pulling them again when nothing has meaningfully changed.

It’s also for the web index they’re all building, I imagine. Lately I’ve been defaulting to web search via chatgpt instead of google, simply because google can’t find anything anymore, while chatgpt can even find discussions on GitHub issues that are relevant to me. The web is in a very, very weird place