Comment by sparkling
8 years ago
So just to clarify: some bug makes Cloudflare leak the HTTP Headers into the HTML being served and those HTML pages containing sensitive Info got cached by Google (and others)?
8 years ago
So just to clarify: some bug makes Cloudflare leak the HTTP Headers into the HTML being served and those HTML pages containing sensitive Info got cached by Google (and others)?
Yes. Think of it this way.
You have a function that strips all colons from your input. For some reason - in certain cases - your code misbehaves and when you are replacing the colons with an empty character you accidentally replace that colon with other data you have in the memory. So now all the colons in your input have been replaced with data that you shouldn't have touched. So now whoever sent you an input, gets back that input + more data they shouldn't be able to see.
And Google in this case caches those output strings.
@homero (since I can't nest a reply any further), it's not the contents of the crawler's request that gets randomly injected into the page that the crawler requests, but rather the contents of other requests to the same Cloudflare server.
Imagine I'm having a chat on some website X, which uses Cloudflare. Cloudflare acts as a man in the middle, meaning my request, and the response, likely pass through its memory at some point to allow me to communicate with X.
Later, a Google bot comes along and requests a page from site Y. Because of this bug, random bits of memory that were left around on the Cloudflare server get inserted into the response to the bot's request. Those bits of memory could be from anything that's gone through that server in the past, including my conversations on website X. The bot then assumes that the content that Cloudflare spits out for website Y is an accurate representation of website Y's contents, and it caches those contents. In this way, my data from website X ends up in Google's cached version of website Y.
But how is Google getting headers from the users of the sites, it should be from their crawler
If I (user A) access upwork.com (I just saw this on the list of affected websites, so it's not meant to be an ad), I am sending them my headers. Let's say my headers and other data are saved in M1 (memory register 1).
Then Google accesses the website as the crawler (user B), and their header and data is saved in M2. However, Google triggered a bug and now has access to M1 as well. So now Google sees their own headers + my data + other garbage.
Imagine this—Google sends a request to get data from malformedhtml.com for crawling purposes. This site's html happens to have that weird incomplete tag problem they mentioned. This site is served by Cloudflare, wherein a buggy script manages to insert some data from the server's memory into the HTML that it returns to Google. Now this data in the memory contains HTTP request headers etc. of _completely unrelated websites_ that are also behind CF.
Google gets this HTML and caches it and that's how it ends up there.
Yeah.
"We leaked information from Customer A to Customer B by accident" is the first order problem.
But the existence of web caches means that all that private information of customer A is potentially fucking everywhere now.
How do you even clean this up? How do you even start?
They leak uninitialized memory contents into the HTML being served; that memory could (and did) contain data from any other traffic that passed through their hands.
So a request sent to Cloudflare customer A's site could return data from Cloudflare customer B, including data that B thought was only being served via https to authenticated users of B.
Not just headers, basically random memory dumps that could contain anything that Cloudflare saw (which is almost everything). Passwords, certificates, you name it.
Essentially. Any headers from any site routing through cloudflare could get injected into the body of a second site's page if that second site was using the obfuscation feature. Those "mis-stuffed" pages could (and were) then cached by, among other things, crawlers like those operated Google and Bing.
Apparently 7xx sites had this enabled, but that affected 4000ish other sites that happened to be on the same infrastructure.
Near as I can tell, the HTTP Headers from one site are being included in HTML of other sites...