← Back to context

Comment by tootahe45

3 years ago

Would be cool if there was something like this for Python. Last time i tried to scrape something interesting i found that one of Cloudflare's enterprise options was easily blocking all of the main http libraries due to the identifiable TLS handshake.

Are you sure they blocked you because of the handshake?

Always thought it was the myriad of cookies and expiry time of said cookies that tend to make non-browser clients more obvious to CF.

  • The site wasn't using it to block me, just to prompt a captcha, without doing so to 'real' browsers.

    The HTTP requests were exact copies of browser requests (in terms of how the server would've seen them), so it was something below HTTP. I ended up finding a lot of info about Cloudflare and the TLS stuff on StackOverflow, with others having similar issues. Someone even made an API to do the TLS stuff as a service, but was too expensive for me. https://pixeljets.com/blog/scrape-ninja-bypassing-cloudflare...

    • Thanks for the response, never came across the particular behaviour.

      fwiw I think when it comes to the 'copy as curl', the HTTP header ordering may be different and it's worth loading up a page twice as some of the cookies are replaced.

      I've used puppeteer as the article talks about. Manages the cookies better. Managed to do continuous requests without getting further CF blocks as opposed to a couple of hundred with cURL (due to cookies different from what CF expect over a time)

      IIRC CF does have a sliding scale of how protected you want a site to be, so perhaps the TLS stuff belongs further up the scale.

I think most of the scraping libraries have stagnated since it's hard to scrape without a headless browser these days...too many sites with client-side rendered content.