Show HN: Curl modified to impersonate Firefox and mimic its TLS handshake

4 years ago (github.com)

61 comments

lwthiker

I run a MITM proxy for adblocking/general filtering and within the past little while I've noticed CloudFlare and other "bot protection" tends to get me blocked out of increasingly more sites I come across in search results, so this will be very useful for fixing that.

However, I should caution that in this era of companies being particularly user-hostile and authoritarian, especially Big Tech, I would be more careful with sharing stuff like this. Being forced to run JS is bad enough; profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.

fake-name 4 years ago
Cloudflare is likely one of the worst things that has happened to the internet in recent history.
Like, I get the need for some protective mechanisms for interactive content/posting/etc, but there should be zero cases where a simple HTTP 200 GET requires javascript/client side crap. If they serve me a slightly stale version of the remote resource (5 minutes/whatnot) that's fine.
They've effectively just turned into a google protection racket. Small/special purpose search/archive tools are just stonewalled.
- buro9 4 years ago
  
  You can't turn it off as a Cloudflare customer either.
  The best you've got is "essentially off" but that wording is such because even with everything disabled there are still edge cases where their security will enforce a JS challenge or CAPTCHA.
  
  1 reply →
- rtpg 4 years ago
  
  Not to be too dismissive of this, but for companies trying to just run a service and getting constantly bombarded by stuff like DDoS issues, Cloudflare and its ilk lets them service a large portion of "legitimate" users, compared to none.
  I don't really know how you resolve that absent just like... putting everything behind logins, though.
  
  3 replies →
- charcircuit 4 years ago
  
  > If they serve me a slightly stale version of the remote resource (5 minutes/whatnot) that's fine.
  Not all sites are configured to do this. Some pages are expensive to render and have no cache layer.
  
  2 replies →
cma 4 years ago
I've noticed even GitHub has a login wall now for comments on open source projects. They truncate them if you aren't logged in, similar to reddit on mobile, instagram, twitter, etc. Hopefully the mobile version doesn't start pushing you to install some crappy apps where you can't use features like tabbed browsing, tab sync with another machine, etc.
- caulk 4 years ago
  
  The reasoning behind that might be the myriad of scrape-and-publish SEO spam pages with GitHub content.
  
  2 replies →
mschuster91 4 years ago

> profiling users based on other traits, and essentially determining if they are using "approved" software, is a dystopia we should fight strongly against. Stallman's Right To Read comes to mind as a very relevant warning story.
Right to Read indeed... fanfiction.net has over the last months become really annoying. Especially at night, when you have the FFN UI set to dark, and then out of nothing a bright white Cloudflare page appears. Or why the Cloudflare "anti bot" protection leads to an endless loop when the browser is the Android web view inside a third-party Reddit client.
oh_sigh 4 years ago
Maybe I'm just a techno-optimist, but I suspect big tech companies don't give a hoot about you running "unapproved" software, but rather care about their services being abused and "unapproved" software is just a useful signal that fails on a tiny percentage of total legit users.
- llampx 4 years ago
  
  You are a lot more charitable than I am. I believe the big tech companies use dark patterns to get us to sign up, improve their metrics and hoover up our data.
- bayindirh 4 years ago
  
  Just trying to keep services operational is a fine goal to pursue as an operator, but forcing users to small inbound funnels for the service is detrimental too. There needs to be better research to be done to allow simpler ways of operation to continue working.
  A browser is becoming a universal agent by itself, but many people (maybe increasingly) use terminal to access to the resources, and stonewalling these paths are never OK in my book.
- saurik 4 years ago
  
  This is a distinction without a difference.

gruez 4 years ago

>impersonate Firefox 95

you should really be impersonating an ESR version (eg. 91). Versions from the release channel is updated every month or so, and everyone has autoupdate enabled. Therefore unless you keep it up to date, your fingerprint is going to stick out like a sore thumb in a few months. On the other hand, ESR sticks to one version and shouldn't change significantly during its one year lifetime. It's still going to stick out to some extent (most people don't use ESR), but at least you have some enterprises who use ESR to blend into.

kalleboo 4 years ago
They should really be impersonating Chrome. If this takes off, Firefox has such a small user share that I could see sites just banning Firefox altogether, like they do with Tor
- jve 4 years ago
  
  I suspect Tor is being banned not because of a small user share.
  Perhaps you may get broken sites with Firefox, because no-one cared. But banning? Seems like a stretch.
  
  2 replies →
lwthiker 4 years ago

Thanks for the suggestion, I had no idea ESR was a thing. I've just added support for Firefox ESR 91 (it was pretty similar and required adding one cipher to the cipher list and changing the user agent).
LanternLight83 4 years ago

I think ESR is the way to go too, but either way, I wonder if some tests can be written to confirm the coverage/similarity of the requests? It would entail automating a both Firefox session and the recording of network traffic, and feels like it might end up as bikeshedding.

vincent-toups 4 years ago

Cool, can't wait for anti-bot protection to start rejecting me because I use firefox.

npteljes 4 years ago

Only a matter of time I'm afraid :( Firefox usage share is already low enough for many sites to make pages for Chrome and maybe Safari only.

jandrese 4 years ago

Given the relative market shares it might make more sense to impersonate Chrome.

lwthiker 4 years ago

I will try to impersonate Chrome next, However, I suspect this is going to be more challenging. Chrome uses BoringSSL, which curl does not support. So it means either enforcing curl to compile with BoringSSL or modifying NSS to look like BoringSSL.
7v3x3n3sem9vv 4 years ago
and make it seem like Firefox has less market share? sounds like a good way to kill Firefox even faster, my 2 cents.
- flawi 4 years ago
  
  Counter argument is service providers just choosing to block anything that looks like Firefox since the market share is so small and it's being used to circumvent their precious protections.
  
  1 reply →

1vuio0pswjnm7 4 years ago

"Some web services therefore use the TLS handshake to fingerprint which HTTP client is accessing them. Notably, some bot protection platforms use this to identify curl and block it."

As a user of non-browser clients (not curl though) I have not run into this in the wild.^1

Anyone have an example of a site that blocks non-browser clients based on TLS fingerprint.

1. As far as I know. The only site I know of today that is blocking non-browser clients appears to be www.startpage.com. Perhaps this is the heuristic they are using. More likely it is something simpler I have not figured out yet.

0xbkt 4 years ago

This might also be an interesting read for those curious about TLS fingerprinting: https://news.ycombinator.com/item?id=29472624

pabs3 4 years ago

Do you plan on getting this merged back into curl with an option to enable it? I can see that being useful for some people.

lwthiker 4 years ago
I hope to do so in the future, for now the implementation is extremely hacky so I doubt it can get accepted into curl.
- bburky 4 years ago
  
  There was a conversation on their mailing list contemplating dropping NSS support. https://curl.se/mail/lib-2022-01/0120.html If you have a use case for NSS in curl, you may want to speak up. Perhaps "I want curl to look exactly like a browser" is a significant use case?
  
  1 reply →

sylware 4 years ago

Currently, I cannot think about anything else other than "noscript/basic (x)html" /IRC to get us out of this, at least for sites where such protocols are "good enough" to provide their services to users over internet. But how? Enlighten the "javascript web" brain-washed devs to make them realize how toxic what they do is? regulations (at least for critical sites)? And how to deal with the other sites: those which devs are scammers and perfectly aware of how toxic they are and keep doing it.

In my own country, for critical sites, I will probably have go to court since 'noscript/basic (x)html" interop was broken in the last few years.

tootahe45 4 years ago

Would be cool if there was something like this for Python. Last time i tried to scrape something interesting i found that one of Cloudflare's enterprise options was easily blocking all of the main http libraries due to the identifiable TLS handshake.

ricardo81 4 years ago
Are you sure they blocked you because of the handshake?
Always thought it was the myriad of cookies and expiry time of said cookies that tend to make non-browser clients more obvious to CF.
- tootahe45 4 years ago
  
  The site wasn't using it to block me, just to prompt a captcha, without doing so to 'real' browsers.
  The HTTP requests were exact copies of browser requests (in terms of how the server would've seen them), so it was something below HTTP. I ended up finding a lot of info about Cloudflare and the TLS stuff on StackOverflow, with others having similar issues. Someone even made an API to do the TLS stuff as a service, but was too expensive for me. https://pixeljets.com/blog/scrape-ninja-bypassing-cloudflare...
  
  1 reply →
tyingq 4 years ago

I think most of the scraping libraries have stagnated since it's hard to scrape without a headless browser these days...too many sites with client-side rendered content.

wswope 4 years ago

Very cool! Thanks for sharing - it’s always nice to learn about fingerprinting tricks and workarounds, from both a privacy and a “don’t unintentionally look like a bot” perspective.

What inspired the project?

oefrha 4 years ago
Motivation is in the blog post: https://lwthiker.com/reversing/2022/02/17/curl-impersonate-f...
- jart 4 years ago
  
  Good blog post. Stuff like this makes me wonder if by 2030 (1) the internet will mostly consist of machine generated content; (2) machines written by normal people in Python won't be authorized to access the machine-generated content anymore due to Protectify; (3) most client traffic will originate from Protectify's network, so people like bloggers won't have any visibility into whether their readers are humans or machines; (4) video compression algorithms will become indistinguishable from deepfakes; and (5) airborne pathogens will make alternatives to the above impractical.

octoberfranklin 4 years ago

This is cool, but is it really needed that often?

There are some industries (virtually all of Wall Street, for example, and certain parts of government) where the company needs to surveil 100% of what their employees do on the web from inside the office. These companies have been running MITM proxies for decades.

Wouldn't any website that rejects a non-browsery TLS client be blocking out these people as well?

lwthiker 4 years ago

They don't block you completely, just present you with a JS challenge that delays your access to the site. A browser, even if behind a MITM proxy, would be able to solve this challenge.

oefrha 4 years ago

Very cool. I would have used Puppeteer/Playwright in a similar scenario, but thanks for sharing the bot detection trick they employ.

javajosh 4 years ago

Handy. Is the TCP handshake, or other details about socket behavior, ever get used for assessing the remote process, and in turn libraries written to mimic known patterns?

tenebrisalietum 4 years ago

Yes. https://nmap.org/book/osdetect-methods.html

kinderjaje 4 years ago

Great work mate, one of my team-mates showed me this library and we might use it in near future.

thompson1 4 years ago

[flagged]

Thorrez 4 years ago
What's that link?
- vultour 4 years ago
  
  It's a spambot