Ask HN: Scaling a targeted web crawler beyond 500M pages/day
3 days ago
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler").
Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern.
The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems.
For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance.
Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.
If you want to access data from websites which prevent it, you gotta use a headless browser with Residential Proxy Network Like Bright Data (formerly Luminati).
Our industry's understanding of consent is terrifying
It’s called hacker news, bro
2 replies →
I'm curious, how do you deal with Cloudflare and similar anti-bot systems? Just keep shopping the job around to different proxies?
it's fairly simple, you use browser profiles and you visit multiple website like a normal guy using residential proxyy network
and cloudflare cannot detect you this way.
the older your browser profile is, the less often cloudflare bans.
Cloudflare reads this forum. By answering your question here, they burn that workaround. Why would someone do that? (No one bring up Warframe)
have you already incorporated common crawl into your index?
Common Crawl is a sample of the web, so it's not that directly helpful for someone wanting to make a product price dataset.