Comment by wraptile

6 months ago

It's still pretty hard to bypass it with open source solutions. To bypass CF you need:

- an automated browser that doesn't leak the fact it's being automated

- ability to fake the browser fingerprint (e.g. Linux is heavily penalized)

- residential or mobile proxies (for small scale your home IP is probably good enough)

- deployment environment that isn't leaked to the browser.

- realistic scrape pattern and header configuration (header order, referer, prewalk some pages with cookies etc.)

This is really hard to do at scale but for small personal scripts you can have reasonable results with flavor of the month playwright forks on github like nodriver or dedicated tools like Flaresolver but I'd just find a web scraping api with low entry price and just drop 15$ month and avoid this chase because it can be really time consuming.

If you're really on budget - most of them offer 1,000 credits for free which will get you avg 100 pages a month per service and you can get 10 of them as they all mostly function the same.

1 comment

wraptile

DanielHB 6 months ago

I do it maybe once a month to fetch <1000 URLs. I do it from my home PC with my internet connection. I was just using puppeteer (headless chromium), I will try making it use my own normal browser instance instead of the built-in one.

Thanks for the tips!