Show HN: Curl modified to impersonate Firefox and mimic its TLS handshake

3 years ago (github.com)

I run a MITM proxy for adblocking/general filtering and within the past little while I've noticed CloudFlare and other "bot protection" tends to get me blocked out of increasingly more sites I come across in search results, so this will be very useful for fixing that.

However, I should caution that in this era of companies being particularly user-hostile and authoritarian, especially Big Tech, I would be more careful with sharing stuff like this. Being forced to run JS is bad enough; profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.

  • Cloudflare is likely one of the worst things that has happened to the internet in recent history.

    Like, I get the need for some protective mechanisms for interactive content/posting/etc, but there should be zero cases where a simple HTTP 200 GET requires javascript/client side crap. If they serve me a slightly stale version of the remote resource (5 minutes/whatnot) that's fine.

    They've effectively just turned into a google protection racket. Small/special purpose search/archive tools are just stonewalled.

    • You can't turn it off as a Cloudflare customer either.

      The best you've got is "essentially off" but that wording is such because even with everything disabled there are still edge cases where their security will enforce a JS challenge or CAPTCHA.

      1 reply →

    • Not to be too dismissive of this, but for companies trying to just run a service and getting constantly bombarded by stuff like DDoS issues, Cloudflare and its ilk lets them service a large portion of "legitimate" users, compared to none.

      I don't really know how you resolve that absent just like... putting everything behind logins, though.

      3 replies →

    • > If they serve me a slightly stale version of the remote resource (5 minutes/whatnot) that's fine.

      Not all sites are configured to do this. Some pages are expensive to render and have no cache layer.

      2 replies →

  • I've noticed even GitHub has a login wall now for comments on open source projects. They truncate them if you aren't logged in, similar to reddit on mobile, instagram, twitter, etc. Hopefully the mobile version doesn't start pushing you to install some crappy apps where you can't use features like tabbed browsing, tab sync with another machine, etc.

  • > profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.

    Right to Read indeed... fanfiction.net has over the last months become really annoying. Especially at night, when you have the FFN UI set to dark, and then out of nothing a bright white Cloudflare page appears. Or why the Cloudflare "anti bot" protection leads to an endless loop when the browser is the Android web view inside a third-party Reddit client.

  • Maybe I'm just a techno-optimist, but I suspect big tech companies don't give a hoot about you running "unapproved" software, but rather care about their services being abused and "unapproved" software is just a useful signal that fails on a tiny percentage of total legit users.

    • You are a lot more charitable than I am. I believe the big tech companies use dark patterns to get us to sign up, improve their metrics and hoover up our data.

    • Just trying to keep services operational is a fine goal to pursue as an operator, but forcing users to small inbound funnels for the service is detrimental too. There needs to be better research to be done to allow simpler ways of operation to continue working.

      A browser is becoming a universal agent by itself, but many people (maybe increasingly) use terminal to access to the resources, and stonewalling these paths are never OK in my book.

>impersonate Firefox 95

you should really be impersonating an ESR version (eg. 91). Versions from the release channel is updated every month or so, and everyone has autoupdate enabled. Therefore unless you keep it up to date, your fingerprint is going to stick out like a sore thumb in a few months. On the other hand, ESR sticks to one version and shouldn't change significantly during its one year lifetime. It's still going to stick out to some extent (most people don't use ESR), but at least you have some enterprises who use ESR to blend into.

  • They should really be impersonating Chrome. If this takes off, Firefox has such a small user share that I could see sites just banning Firefox altogether, like they do with Tor

    • I suspect Tor is being banned not because of a small user share.

      Perhaps you may get broken sites with Firefox, because no-one cared. But banning? Seems like a stretch.

      2 replies →

  • Thanks for the suggestion, I had no idea ESR was a thing. I've just added support for Firefox ESR 91 (it was pretty similar and required adding one cipher to the cipher list and changing the user agent).

  • I think ESR is the way to go too, but either way, I wonder if some tests can be written to confirm the coverage/similarity of the requests? It would entail automating a both Firefox session and the recording of network traffic, and feels like it might end up as bikeshedding.

Cool, can't wait for anti-bot protection to start rejecting me because I use firefox.

  • Only a matter of time I'm afraid :( Firefox usage share is already low enough for many sites to make pages for Chrome and maybe Safari only.

Given the relative market shares it might make more sense to impersonate Chrome.

  • I will try to impersonate Chrome next, However, I suspect this is going to be more challenging. Chrome uses BoringSSL, which curl does not support. So it means either enforcing curl to compile with BoringSSL or modifying NSS to look like BoringSSL.

  • and make it seem like Firefox has less market share? sounds like a good way to kill Firefox even faster, my 2 cents.

    • Counter argument is service providers just choosing to block anything that looks like Firefox since the market share is so small and it's being used to circumvent their precious protections.

      1 reply →

"Some web services therefore use the TLS handshake to fingerprint which HTTP client is accessing them. Notably, some bot protection platforms use this to identify curl and block it."

As a user of non-browser clients (not curl though) I have not run into this in the wild.^1

Anyone have an example of a site that blocks non-browser clients based on TLS fingerprint.

1. As far as I know. The only site I know of today that is blocking non-browser clients appears to be www.startpage.com. Perhaps this is the heuristic they are using. More likely it is something simpler I have not figured out yet.

Do you plan on getting this merged back into curl with an option to enable it? I can see that being useful for some people.

Currently, I cannot think about anything else other than "noscript/basic (x)html" /IRC to get us out of this, at least for sites where such protocols are "good enough" to provide their services to users over internet. But how? Enlighten the "javascript web" brain-washed devs to make them realize how toxic what they do is? regulations (at least for critical sites)? And how to deal with the other sites: those which devs are scammers and perfectly aware of how toxic they are and keep doing it.

In my own country, for critical sites, I will probably have go to court since 'noscript/basic (x)html" interop was broken in the last few years.

Would be cool if there was something like this for Python. Last time i tried to scrape something interesting i found that one of Cloudflare's enterprise options was easily blocking all of the main http libraries due to the identifiable TLS handshake.

  • Are you sure they blocked you because of the handshake?

    Always thought it was the myriad of cookies and expiry time of said cookies that tend to make non-browser clients more obvious to CF.

    • The site wasn't using it to block me, just to prompt a captcha, without doing so to 'real' browsers.

      The HTTP requests were exact copies of browser requests (in terms of how the server would've seen them), so it was something below HTTP. I ended up finding a lot of info about Cloudflare and the TLS stuff on StackOverflow, with others having similar issues. Someone even made an API to do the TLS stuff as a service, but was too expensive for me. https://pixeljets.com/blog/scrape-ninja-bypassing-cloudflare...

      1 reply →

  • I think most of the scraping libraries have stagnated since it's hard to scrape without a headless browser these days...too many sites with client-side rendered content.

Very cool! Thanks for sharing - it’s always nice to learn about fingerprinting tricks and workarounds, from both a privacy and a “don’t unintentionally look like a bot” perspective.

What inspired the project?

  • Motivation is in the blog post: https://lwthiker.com/reversing/2022/02/17/curl-impersonate-f...

    • Good blog post. Stuff like this makes me wonder if by 2030 (1) the internet will mostly consist of machine generated content; (2) machines written by normal people in Python won't be authorized to access the machine-generated content anymore due to Protectify; (3) most client traffic will originate from Protectify's network, so people like bloggers won't have any visibility into whether their readers are humans or machines; (4) video compression algorithms will become indistinguishable from deepfakes; and (5) airborne pathogens will make alternatives to the above impractical.

This is cool, but is it really needed that often?

There are some industries (virtually all of Wall Street, for example, and certain parts of government) where the company needs to surveil 100% of what their employees do on the web from inside the office. These companies have been running MITM proxies for decades.

Wouldn't any website that rejects a non-browsery TLS client be blocking out these people as well?

  • They don't block you completely, just present you with a JS challenge that delays your access to the site. A browser, even if behind a MITM proxy, would be able to solve this challenge.

Very cool. I would have used Puppeteer/Playwright in a similar scenario, but thanks for sharing the bot detection trick they employ.

Great work mate, one of my team-mates showed me this library and we might use it in near future.