Comment by plastic041
5 days ago
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
And it doesn't care about robots.txt.
5 days ago
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
And it doesn't care about robots.txt.
Good point. The anti-bot patches here (via Patchright) are about preventing the browser from being detected as automated — things like CDP leak fixes so Cloudflare doesn't block you mid-session. It's not about bypassing access restrictions.
Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
robots.txt is the most basic access restrictions and it doesn't even read it, while faking itself as human[0]. It is about bypassing access restrictions.
[0]: https://github.com/lightfeed/extractor/blob/d11060269e65459e...
Regardless. You should still respect robots.txt..
We do respect robots.txt production - also scraping browser providers like BrightData enforces that.
I will add a PR to enforce robots.txt before the actual scraping.
2 replies →
> It's not about bypassing access restrictions.
Yes. It is. You've just made an arbitrary choice not to define it as such.
I will add a PR to enforce robots.txt before the actual scraping.
1 reply →