Comment by Cynddl
11 hours ago
I'm going to repeat myself as I do everytime I encounter such tools. These tools DO NOT provide anonymization, and especially not at the level required by the EU's GDPR (where the notion of PII does not exist).
As a computer scientist and academic researcher having worked on this topic for now more than a decade (some of my work if you are interested: [1, 2]), re-identification is often possible from few pieces of information. Masking or replacing a few values or columns will often not provide sufficient guarantees—especially when a lot of information is being released.
What this tool does is called ‘pseudonymization’ and maybe, if not very carefully, ‘de-identification’ in some case. With colleagues, reviewed all the literature and industry practices a few months ago [3], and our conclusion was:
> We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today.
This is clearly not what this tool is doing.
[1] https://www.nature.com/articles/s41467-019-10933-3 [2] https://www.nature.com/articles/s41467-024-55296-6 [3] https://www.science.org/doi/10.1126/sciadv.adn7053
Of course there's no perfect solution for anonymizing a dataset...
The extension offers a large panel of masking functions : some are pseudonymizing functions but others are more destructive. For instance there's large collection of fake data generators ( names, address, phones, etc. )
It's up to the database administrator or the application developer to decide which columns need to be masked and how it should be masked.
In some use cases, pseudonymization is enough and others anonymization is required....
Seems like if you're doing static masking and you mask enough data, this works just great. Am I missing something?