Comment by squigz

1 day ago

> To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

Why does there have to be a "machine-communicable way"? If these developers cared about such things they would spend 20 seconds looking at this page. It's literally one of the first links when you Google "metabrainz"

https://metabrainz.org/datasets

You expect the developers of a crawler to look at every site they crawl and develop a specialized crawler for them? That’s fine if you’re only crawling a handful of sites, but absolutely insane if you’re crawling the entire web.

  • Isn't the point of AI that it's good at understanding content written for humans? Why can't the scrapers run the homepage through an LLM to detect that?

    I'm also not sure why we should be prioritizing the needs of scraper writers over human users and site operators.

  • if you are crawling the entire web, you should respect robots.txt and don't fetch anything disallowed. full stop.