Make patient data unidentifiable is not foolproof, researchers report.
Google researchers recently found the efforts to make patient information anonymous don’t always work. The company has drawn criticism over the patient its employees had access to identifiable patient information. The company described its work in a recent paper published in the journal BMC Medical Informatics and Decision Making, discovering even their best efforts to de-identify health data would leave some people exposed.
Nineteen Google researchers used sample de-identified patient data to design an algorithm that could spot breast cancer in mammograms. Without any one person’s identifiers, there’s less of a risk that sensitive information about them, which may keep them from being gainfully employed or even being considered for certain loans.
The researchers found, on average, “automated tools that use machine learning to comb through patient data only succeed at rendering 97% of it anonymous,” according to two recent studies. “Humans doing the job manually are even worse, with one study pegging people’s ability to hunt through data and find the patient identifiers that need to be removed as low as 81%.”
“The one thing that’s always seemed to haunt the [health care] field is the question, ‘How good is good enough?’” said Leonard D’Avolio, an assistant professor of medicine at Harvard and the co-founder of a performance improvement startup called Cyft.
One area the researchers honed-in on is “free text,” or the medical notes clinicians take that cannot be auto-categorized, which often includes patient’s personal information, such as lifestyle factors and preference, and names of family members. “Reports on X-ray images and lab tests, emails, and phone calls can also be included as free text,” the researchers reported.
The Google researchers experimented with four different ways of de-identifying data. For some of the approaches, the company designed a tool to automate the de-identification work. In the end, even after attempting to come up with the most extensive methods for de-identification, the researchers anonymized “only 97% to 99% of the data.” To D’Avolio, “This isn’t good enough…None of the policy or institutional decision-makers I was speaking with at the time had a tolerance for 99% de-identified.”
For health systems that have the resources to do so, Google recommended they “have humans label a subset of the data manually so they can partially customize tools to automate the de-identification process. Only 20 to 80 labeled samples,” they conclude in the paper, “are enough to make a customized tool perform slightly better than an existing one, and 1,000 labeled samples are enough to give results on par with a fully customized tool.”
Researchers from Belgium’s Université catholique de Louvain (UCLouvain) and Imperial College London previously put together a model to estimate how easy it would be to deanonymize any arbitrary dataset. They had found “A dataset with 15 demographic attributes would render 99.98% of people in Massachusetts unique.” And, for smaller a dataset, “it would not take much to reidentify people living in Harwich Port, Massachusetts, a city of fewer than 2,000 inhabitants.”
The study reported, “Moving forward, [we] question whether current deiidentification practices satisfy the anonymization standards of modern data protection laws such as GDPR and CCPA [California consumer privacy act] and emphasize the need to move, from a legal and regulatory perspective, beyond the deidentification release-and-forget model.”