← Back to context

Comment by michaelt

15 hours ago

It's exceptionally difficult to avoid the data being de-anonymised.

If an 'anonymised' medical record says the person was born 6th September 1969, received treatment for a broken arm on 1 April 2004, and received a course of treatment in 2009 after catching the clap on holiday in Thailand - that's enough bits of information to uniquely identify me.

And medical researchers are usually very big on 'fully informed consent' so they can't gloss over that reality, hide it in fine print or obsfucate it with flowerly language. They usually have to make sure the participants really understand what they're agreeing to.

It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.

... received a course of treatment in 2009 after catching the clap on holiday in Thailand

Yeah, sorry about that

In my experience with health data, the dates are usually offset by a random but constant amount for each person (e.g. id 12345 will have all their dates shifted by +5 weeks) to avoid identification by dates.

Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.

  • Location data is very readily available, so you can easily correlate visits to a health facility with a treatment, and even with an offset, you can probably uniquely identify someone with 4 visits depending on the size of the medical facility.

    • I had access to several health datasets for my research in the past. Date of birth was rarely given, especially for the bigger projects where there were more resources to allocate to privacy protection. Neither was date of death, location, or visits to a health facility with a treatment. Typically the relevant variables are age (in years), treatment type and possibly number of cycles. Probably insufficient to identify someone without access to hospital records. But if you have that, you have all these data anyways.

      Most researchers likely would want to summarize these data in a similar way anyway, so this works out nicely.