← Back to context

Comment by joshuamorton

6 years ago

> You need to do something like aggregate the data into a small number of groups such that individual records no longer exist. Techniques like "differential privacy" let you control precisely how "anonymous" your data is by e.g. mixing in carefully crafted noise.

Correct, and another anonymization technique (in place of differential privacy) is k-anonymity. In k-anonymity schemes, you ensure that in any given table no row corresponds to any fewer than k individuals. Why is this useful? Well let's say you have some, say, 10-15 bit identifier. You can take a request from a user that contains information that might when combined, be identifying. Say: coarseish location (state/country), device metadata (browser version, OS version), and coarse access time (the hour and day of week). Combining all 3 (or 4 if you include the psuedonymous ID) is enough to uniquely identify at least some users. Then let's say you also track some performance statistics about the browser itself.

But any single piece of data (plus the pseudonymous ID) is not enough to identify any specific user. So if you use the psuedonymous ID as a shared foreign key, you can join across the tables and get approximate crosstabs without uniquely identifying any specific user. Essentially, if you want to ask if there are performance differences between version N and version N+1, you can check the aggregate performance vs. the aggregate count of new vs. old browser version, and with 8K samples, you're able to draw reasonable conclusions. And in general you can do this across dimensions or combinations of dimensions that might normally contain enough pieces of info to identify a single user.

This is essentially the same idea as differential privacy, although without the same mathematical precision that differential privacy can provide. (By this I don't mean that the data can be re-identified, just that differential privacy can be used to provide tighter bounds on the anonymization, such that the statistical inferences you can gather are more precise. k-anonymity is, perhaps, a less mathematically elegant tool).

Specifically, I'm describing k-anonymity using x-client-data as a Quasi-identifier in place of something like IP or MAC address. You can find those terms in the "See Also" section of the differential privacy wiki page you linked. Google is mentioned in those pages as a known user of both differential privacy and k-anonymization in other tools.

Hopefully that answers your question of why Google would want such a thing.

> simply saving the data gives Google the option to use it as an identifier in the future.

Yes, but that doesn't mean that they're currently in violation of the GDPR, which is what a number of people keep insisting. I'm not claiming that it's impossible for Google to be doing something nefarious with data (although I will say that in general I think that's an unreasonably high bar). Just that the collection of something like this isn't an indication of nefarious actions, and is in fact likely the opposite.