Comment by Nextgrid
14 hours ago
Running the bot nowadays is hard, because a lot of sites will now block you - not just by asking nicely via robots.txt, but by checking your actual source IP. Once they see it's not Google, they send you a 403.
14 hours ago
Running the bot nowadays is hard, because a lot of sites will now block you - not just by asking nicely via robots.txt, but by checking your actual source IP. Once they see it's not Google, they send you a 403.
Cloudflare’s ubiquity makes bootstrapping a search index via crawler virtually impossible, but what about data sources like Common Crawl?