← Back to context

Comment by joshuamorton

6 years ago

Interesting, TIL. That doesn't change the major point I was making though, which is that an anonymized identifier (such as the 13-bit ID under discussion) isn't personal info, even if it might have originally been collected along side data which is personal info. If I give you said 13 bit ID, you need other info to back out a single person, the anonymous ID corresponds to multiple IPs.

I think you're still missing the point. Google transmits personal data to their servers without user consent. The value of x-client-data is personal data, because it is associated with an IP address during transit, due to how HTTP requests work. The nature of the data, what is being done with it on the server, and the location of the server are all irrelevant in this instance, the only important part is that personal data has left the browser in the form of a request, and it reached a Google server.

This data collection would only be exempt from GDPR if the data would be required for the service to function, but that is not the case with x-client-data.

  • > The value of x-client-data is personal data, because it is associated with an IP address during transit, due to how HTTP requests work.

    This is not correct. The x-client-data is not personal data. x-client-data associated with an IP address is personal data. As soon as you separate the client-data from the IP, the client data stops being personal data. IOW, the tuple (x-client-data, IP) is personal data. But x-client-data on its own isn't personal data, because it cannot be used to infer the IP on its own.

    I don't know where you're getting this "if two pieces of data ever touch and one of them is personal data the other one is now also contaminated as personal data". It's not true. That would make the existence of anonymous data (which the GDPR specifies as a thing) practically speaking impossible to have on the web, since all requests are associated with the IP on receipt. (or actually even worse, it would make the process of anonymizing data impossible in general, since the anonymization process associates the anonymized data with the original personal data).

    To be precise, the GDPR defines anonymized data as "data rendered anonymous in such a way that the data subject is not or no longer identifiable.". The x-client-data header is exactly that. The subject of the header is not identifiable by the x-client-data header alone. Therefore the header is anonymous and not subject to strong GDPR reqs.

    For the client data header to be personal data, you'd need to describe a scheme such that, given an x-client-data header, and only an x-client-data header, you could identify one (and only one) unique person to whom that header corresponds. You're welcome to come up with such a scheme, but my intro CS classes taught me that bucketed hashing is irreversible, and with 8192 buckets, you're not going to be able to uniquely identify anyone specific.

    • The Chrome whitepaper is written in a way to make you believe there is only 8000 possibilities.

      But read carefully what they say; they say there is only 8000 possibilities if the crash reporting functionality is disabled (not by default).

      Otherwise the marker is a huge differentiator (I haven't seen any duplicates personally)

    • > That would make the existence of anonymous data practically speaking impossible to have on the web

      For almost every type of data that is true. Transforming or substituting data doesn't make it anonymous; the patters in the data are still present. To produce actually anonymous data you have to do what the GDPR instructed: corrupt the data ("rendered anonymous") severely enough that the "data subject is ... no longer identifiable". You need to do something like aggregate the data into a small number of groups such that individual records no longer exist. Techniques like "differential privacy" let you control precisely how "anonymous" your data is by e.g. mixing in carefully crafted noise.

      > 8192 bucket

      While others have pointed out that this isn't actually limited to 13 bits of entropy for most people, there are at least two reasons that field is still very personally identifying. First, "x-client-data on its own" never happens. Google isn't wasting time and money implementing this feature to make an isolated database with a single column. At no point will the x-client-data value (or any other type of data they capture) ever sit in isolation. I used the IPv4 Source Address as an example because it will necessarily be present in the header of the packets that transport the x-client-data header over the internet. Suggesting that Google would ever use this value in isolation is almost insulting to Google; why would they waste their expensive developer time to create, capture, and manage data that is obviously useless?

      However, lets say they did make and isolated system that only ever received 13 bit integers stripped of all other data. Surely that wouldn't be personally identifiable? If they store it with a locally generated high resolution timestamp they can re-associate the data with personal accounts by correlating the timestamps with their other timestamped databases (web server access logs, GA, recaptcha, etc).

      > you'd need to describe a scheme such that, given an x-client-data header, and only an x-client-data header, you could identify one (and only one) unique person to whom that header corresponds

      You should first describe why google would ever use that header and only that header. Even if they aren't currently using x-client-data as an identifier or as additional fingerprintable entropy, simply saving the data gives Google the option to use it as an identifier in the future.

      [1] https://www.youtube.com/watch?v=pT19VwBAqKA https://en.wikipedia.org/wiki/Differential_privacy

      1 reply →