← Back to context

Comment by adelie

9 months ago

my understanding is that there's a bit of a catch-22 with data removal - if you request that a data broker remove ALL of your information, it's impossible for them to keep you from reappearing in their sources later on because that would require them to retain your information (so they can filter you out if you appear again).

I’ve heard this claim, but they could use some sort of bloom filter pr cryptographic hashing to block profiles that contain previously-removed records.

There could also be a shared, trusted opt-out service that accepted information and returned a boolean saying “opt-out” or “opt-in”.

Ideally, it’d return “opt-out” in the no-information case.

  • Hash-based solutions aren't as easy as we might hope.

    You store a hashed version of my SSN, or my phone number, to represent my opt-out? Someone can just hash every number from 000-00-0000 to 999-99-9999 and figure out mine from that.

    You hash the entire contents of the profile - name+address+phone+e-mail+DOB+SSN - and the moment a data source provides them with a profile only containing name+address+email - the missing fields mean the hashes won't match.

    A trusted third party will work a lot better IMHO.

    And of course none of the data brokers have much reason to make opt-outs work well, in the absence of legislation and strict enforcement - it's in their commercial interests to say they "can't stop your data reappearing"

    • > Someone can just hash every number from 000-00-0000 to 999-99-9999 and figure out mine from that.

      That's what salts are for, right? It wouldn't be too hard to issue a very large, known, public salt alongside each SSN.

      > And of course none of the data brokers have much reason to make opt-outs work well, in the absence of legislation and strict enforcement - it's in their commercial interests to say they "can't stop your data reappearing"

      This is the actual reason, IMHO.

      9 replies →

  • Yup, it would be trivial to create a one way hash of various attributes to perma ‘opt out’ someone.

    But how would they keep making money that way?

  • So for a perfect match they'd need to have some sort of unique identifier that's present in the first set of data you ask them to remove, as well as being present in any subsequent "acquisitions" or "scrapes" of your data.

    If these devs that scrape/dump/collate all this info are anything like the ones I've seen, and they're functioning in countries like the US and UK whereby you don't have individual identifiers that are pretty unique, then I'd say the chance of them being able to get such a "unique" key on you to remove you perpetually, is next to impossible. And if it's even close to being "hard", they'll not even bother. Doubley-so if this service/people/data is anything like the credit-score companies, which are notoriously bad at data de duplication and sanitation.

    Likewise, if you want them to do some sort of removal using things other than a unique identifier, then you have to have some sort of function that determines closeness between the two records. From what I've heard, places like Interpol, countries' border-control and police agencies usually use name, surname and dob as a combination to match. Amazingly unique and unchanging combination, that one! /s

Sorry, I value my legal rights over the viability of the data broker industry. If they can’t figure out a way for lawfully not collecting my data, they should not collect data period.

  • I mean, if we’re not allowed to know that we’re not allowed to surveil the shit out of you, it seems like something we can’t worry about

1. They could be required to store a private copy of the removal requests, data that they can't sell (not ideal)

2. Sounds like "data brokers" that sell private information just shouldn't exist...

  • > They could be required to store a private copy of the removal requests

    They would leak that in the next data breach.

They could store a hash.

  • Which would never work because real life data is messy so the hashes would not match. Even something as simple as SSN + DOB runs into loads of potential formatting and data entry issues you'll have to perfectly solve before such a system could work, and even that makes assumptions as to what data will be available from each dataset. Some may be only name and address. Some may include DoB, but the person might have lied about their DoB when filling out the form. The people entering it might have misspelled their name. It might be a person who put in a fake SSN because they're an illegal immigrant without a real one. Data correlation in the real world is a nightmare.

    When you tell a data broker to delete all of the data about you, how can you be sure they get ALL of the data about you, including the ones where your name is misspelled or the DoB is wrong or it lists and old address or something? Even worse if someone comes around later and discovers the orphan data when adding new data about you and fixes the glitch, effectively undoing the data delete.

    It's a catch-22 that if you want them to not collect data about you they need a full profile on you in order to be able to reject new data. A profile that they will need to keep up-to-date, which is what they were doing already.

    • > Even something as simple as SSN + DOB runs into loads of potential formatting and data entry issues you'll have to perfectly solve

      You don’t have to solve it perfectly to be an improvement.

      Also this is BS. Not every bit of data is perfectly formatted and structured but both of your examples are structured data. You can 100% reliably and deterministically hash this data.

      There’s so much in your argument that can be replied with “imperfect is better than status quo”. If you give someone the wrong DOB, it’s “not you” anyways, at least let me scrub my real data even if the entry is imperfect for some people or some records.

      5 replies →

    • There's a trivial way to not re-add data that was removed: don't do it without user opt-in, whom admittedly you have access to ask at the moment of data collection. If you don't have the ability to ask users to opt in, you probably shouldn't be collecting the data anyway, with very few exceptions like criminal records.

      edit for clarity: by criminal records, I mean for the official management of them, not for scraping their content.