In AI training data, underrepresented populations face greater privacy risks
Underrepresented patient populations have enough problems with the healthcare system on their plate, but the AI era might be adding one more. New research published in Nature reveals that patients with unique clinical and demographic contributions to AI training data might be easier to unmask and re-identify.
When subjected to membership inference attacks (MIAs), people from underrepresented groups often faced higher privacy risks than other populations because their records are more distinctive within the training data and easier to trace back to a specific individual.
The study introduces urgency around an emerging concept of “privacy equity,” not just clinical bias, as stakeholders work to introduce more inclusive and representative data into the AI ecosystem.
What the study found
The study, conducted by researchers in Germany, examined whether an attacker could determine if a specific person’s data had been included in a model’s training dataset.
Using six common healthcare data types, including chest X-rays, mammograms, retinal images, dermatology images, electrocardiograms, and EHR data, they found that some individual patient records were highly vulnerable to membership inference attacks, with certain attacks achieving near-perfect success at determining whether those records had been included in a model’s training data.
Interestingly, the largest models were even more vulnerable than smaller ones, with the number of patients with near-perfect attack success increasing from 1 out of 10,000 in the smallest datasets to just 1 in 10 in the largest models tested.
The team also found that the privacy risks associated with re-identification were not distributed equally across patient populations. Individuals from certain patient subgroups, including groups defined by race, insurance status and disease status, were disproportionately represented among the records at highest risk of successful membership inference attacks.
In one emergency department dataset, for example, records from Black patients were 31% more common among the highest-risk records than would be expected based on their representation in the overall dataset. Patients covered by Medicaid were 126% more common in the highest-risk group, while patients with cancer were 18% more common than expected.
As a result, current methods of evaluating the privacy risk of contributing data to AI training models, which tend to offer an assessment of aggregate risk, might not be accurately assessing the risks for individuals, especially those with less common clinical or demographic features.
Why it matters for AI governance, privacy, and consent
The findings highlight a fundamental tension in training AI models on patient data. Increased representation is essential for improving accuracy, avoiding bias, and promoting health equity. But if participation among underrepresented groups exposes those individuals to greater privacy risks, the likelihood of building trust and securing consent among those populations may decrease, thereby exacerbating the inequities these groups face even further.
To ensure that healthcare AI developers can retain and expand access to data from underrepresented populations, they must build in “privacy equity” to their AI governance frameworks.
The authors of the study suggest several key actions for operationalizing this concept, including evaluating models for privacy leakage before deployment. Developers shouldn’t assume that standard de-identification techniques automatically protect patient privacy. Instead, models should be tested specifically for susceptibility to MIAs as a core part of the development and validation process.
Privacy should also be treated as a continuous risk management issue rather than only addressing it through the lens of consent or HIPAA compliance. Routinely assessing privacy risks at every point in the AI lifecycle can help developers and users to address the key question raised by the study: who really bears the risks when models leak information?
If underrepresented groups are disproportionally affected by risks from privacy attacks, that inequity must be mitigated before deployment – and any new risks that may crop up as AI evolves over time need to be proactively identified as they appear.
By better understanding the privacy factors that might affect different populations contributing data to the AI environment, healthcare organizations develop stronger governance frameworks and stay one step ahead of newly identified risks and barriers that may disproportionately impact certain groups.
Jennifer Bresnick is a journalist and freelance content creator with a decade of experience in the health IT industry. Her work has focused on leveraging innovative technology tools to create value, improve health equity, and achieve the promises of the learning health system. She can be reached at [email protected].