← Back to context

Comment by ethin

1 day ago

You do realize that these AI scrapers are most likely written by people who have no idea what they're doing right? Or they just don't care? If they were, pretty much none of the problems these things have caused would exist. Even if we did standardize such a thing, I doubt they would follow it. After all, they think they and everyone else has infinite resources so they can just hammer websites forever.

I realise you are making assertions for which you have no evidence. Until a standard exists we can't just assume nobody will use it, particularly when it makes the very task they are scraping for simpler and more efficient.

  • > I realise you are making assertions for which you have no evidence.

    We do have evidence, which is their current behavior. If they are happy ignoring robots.txt (and also ignoring copyright law), what gives you the belief that they magically won't ignore this new standard? Sure, it in theory might save them money, but if there's one thing that I think is blatantly obvious it is that money isn't what these companies care about because people just keep turning on the money generator. If they did care about it, they wouldn't be spending far more than they earn, and they wouldn't be creating circular economies to try to justify their existences. If my assertion has no evidence, I don't exactly see how yours does either, especially since we have seen that these companies will do anything if it means getting what they want.

  • A lot of the internet is built on trust. Mix in this article describing yet another tragedy of the Commons and you can see where this logically ends up as.

    Unless we have some government enforcing the standard, another trust based contract won't do much.

    • > A lot of the internet is built on trust.

      Yes. In this context, the problem is that you cannot trust websites to provide a standardized bulk download options. Most of them have (often pretty selfish or user-abusive) reasons not to provide any bulk download, much less proactively conform to some bottom-up standards. As a result, unless one is only targeting one or few very specific sites, even thinking about making the scrapper support anything but the standard crawling approach costs more in developer time than the benefit it brings.

  • Simpler and efficient for who? I imagine some random guy vibe coding "hi chatgpt I want to scrape this and this website", getting something running, then going to LinkedIn to brag about AI. Yes I have no hard evidence for this, but I see things on LinkedIn.

    • That's not the problem being discussed here, though. That's normal usage, and you can hardly blame AI companies for shitty scrapers random users create on demand, because it's merely a symptom of coding getting cheap. Or, more broadly, the flip side of the computer becoming an actual "bicycle for the mind" and empowering end-users for a change.