Comment by homero

8 years ago

But how is Google getting headers from the users of the sites, it should be from their crawler

If I (user A) access upwork.com (I just saw this on the list of affected websites, so it's not meant to be an ad), I am sending them my headers. Let's say my headers and other data are saved in M1 (memory register 1).

Then Google accesses the website as the crawler (user B), and their header and data is saved in M2. However, Google triggered a bug and now has access to M1 as well. So now Google sees their own headers + my data + other garbage.

Imagine this—Google sends a request to get data from malformedhtml.com for crawling purposes. This site's html happens to have that weird incomplete tag problem they mentioned. This site is served by Cloudflare, wherein a buggy script manages to insert some data from the server's memory into the HTML that it returns to Google. Now this data in the memory contains HTTP request headers etc. of _completely unrelated websites_ that are also behind CF.

Google gets this HTML and caches it and that's how it ends up there.