Comment by riskable
12 hours ago
One of the best ways to handle this sort of thing is to put things like PII in a separate database entirely and replace it with a token in the "main" database. When something like PII actually needs to be retrieved you first retrieve the token and then search the other database for said token to get the real data.
It certainly complicates things but it provides an additional security layer of separation between the PII and it's related data. You can provide your end users access to a database without having to worry about them getting access to the "dangerous" data. If they do indeed need access to the data pointed to via the token they can request access to that related database.
This method also provides more performance since you don't need to encrypt the entire database (which is often required when storing PII) and also don't need to add extra security context function calls to every database request.
A nice tradeoff in many cases is to have a separate schema rather than a separate database, which allows preserving referential integrity and using the database’s RBAC to restrict access to the schemas. This also means things like cascading deletes can still work.
The is basically just a foreign database key which, in most cases, is not sufficient to satisfy industry and regulatory requirements for anonymization and storage of PII.
Eh. I get your point and truly appreciate structural safeguards as opposed to aspirational ones but this is really not as doable as you make it out to be, and doing it properly would be a full blown product in its own right. First, this only works if you have a very narrow interpretation of PII. Once you realize most of your non-int/uuid unique indexes (and all your join predicates) are probably PII in some way or the other, the scope of the problem greatly increases. How does your solution work when you need to group by PII, full text search by PII, filter by PII, etc?