Comment by LatencyKills
12 hours ago
Couldn't this be used to locate private data in unstructured text without having to rely on other means of PII detection?
1. Pass the raw text through the filter to obtain the spans.
2. Map all the spans back to the original text.
Now you have all the PII information.
Yep, and already has been done.
https://github.com/chiefautism/privacy-parser
If you have the redacted and unredacted versions, then you can diff them; that seems unsurprising? Unless I'm really misunderstanding "spans"?
> If you have the redacted and unredacted versions, then you can diff them; that seems unsurprising?
I'm suggesting that a model designed for high-accuracy redaction can also be used to find all PII in unredacted text. For example, if I don't already know how to find PII (e.g., regex, NLP, etc.) I can use OpenAI's Privacy Filter model to do the work for me.
And because each span has a type (PRIVATE_NAME, etc.) I don't even need to do any work to find only the specific information I am looking for; something that simple diffing wouldn't do.
I'm not saying it's an issue, I just think it is interesting that a tool designed to protect PII can also be used to find it with minimal effort. And it looks like someone already implemented it: https://github.com/chiefautism/privacy-parser.