Comment by squigz

25 days ago

> To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

Why does there have to be a "machine-communicable way"? If these developers cared about such things they would spend 20 seconds looking at this page. It's literally one of the first links when you Google "metabrainz"

https://metabrainz.org/datasets

5 comments

squigz

what 25 days ago

You expect the developers of a crawler to look at every site they crawl and develop a specialized crawler for them? That’s fine if you’re only crawling a handful of sites, but absolutely insane if you’re crawling the entire web.

wtetzner 25 days ago
Isn't the point of AI that it's good at understanding content written for humans? Why can't the scrapers run the homepage through an LLM to detect that?
I'm also not sure why we should be prioritizing the needs of scraper writers over human users and site operators.
- crazygringo 25 days ago
  
  How is passing a site's homepage to an LLM supposed to make it develop a custom crawler?
  
  1 reply →
j16sdiz 25 days ago

if you are crawling the entire web, you should respect robots.txt and don't fetch anything disallowed. full stop.