Comment by marginalia_nu
11 days ago
Well there's common crawl, which is supposed to be that. Though ironically it's been under so much load from AI startups attempting to greedily gobble down its data it was basically inaccessible the last time I tried to use it. Turtles all the way down it seems.
There's probably a gap in the market for something like this. Crawling is a bit of a hassle and being able to outsource it would help a lot of companies. Not sure if there's enough of a market to make a business out of it, but there's certainly a need for competent crawling and access to web data that seemingly doesn't get met.
Common Crawl is great, but it only updates monthly and doesn’t do transformations. It’s good for seeding a search engine index initially, but wouldn’t be suitable for ongoing use. But it’s generally the kind of thing I’m talking about, yeah.