Comment by yjftsjthsd-h
9 hours ago
If you have the redacted and unredacted versions, then you can diff them; that seems unsurprising? Unless I'm really misunderstanding "spans"?
9 hours ago
If you have the redacted and unredacted versions, then you can diff them; that seems unsurprising? Unless I'm really misunderstanding "spans"?
> If you have the redacted and unredacted versions, then you can diff them; that seems unsurprising?
I'm suggesting that a model designed for high-accuracy redaction can also be used to find all PII in unredacted text. For example, if I don't already know how to find PII (e.g., regex, NLP, etc.) I can use OpenAI's Privacy Filter model to do the work for me.
And because each span has a type (PRIVATE_NAME, etc.) I don't even need to do any work to find only the specific information I am looking for; something that simple diffing wouldn't do.
I'm not saying it's an issue, I just think it is interesting that a tool designed to protect PII can also be used to find it with minimal effort. And it looks like someone already implemented it: https://github.com/chiefautism/privacy-parser.