How machine learning models might be failing our most vulnerable patients
Imagine walking into a doctor’s office at 65 years old, experiencing chest pain, only to have an AI system systematically underestimate your risk of heart disease. Or picture being a young woman with cardiac symptoms, facing near-certain misdiagnosis from the very technology meant to save lives.
This isn’t science fiction—it’s the reality I uncovered in my recent research on age bias in machine learning models for heart disease prediction. While the AI community has made significant strides in addressing racial and gender bias, we’ve largely overlooked a demographic factor that affects us all: age.
Working with my colleagues at the University of Nottingham, I analyzed the widely-used UCI Heart Disease dataset to understand how ML models perform across different age groups. The results were striking:
The disparities I uncovered were both systematic and severe. Male patients aged 60 and older experienced 20% higher false-negative rates compared to younger males, despite having significantly higher actual disease prevalence. This means that the very population most at risk of heart disease was being systematically underdiagnosed by the AI systems designed to help them. Even more concerning was the near-complete diagnostic failure for young women, with false-negative rates approaching 100% across all models tested. These women were essentially invisible to the algorithms, their symptoms dismissed by systems that had learned to associate heart disease primarily with older male patients. Meanwhile, older men faced the opposite problem, receiving the highest false-alarm rates and experiencing unnecessary anxiety and healthcare costs from overdiagnosis.
However, here’s the kicker: when we examined the overall model performance, everything appeared to be fine. Standard accuracy metrics were consistently above 70% across all models. The bias was hiding in plain sight, masked by aggregate statistics.
To understand what these statistical disparities mean in practice, I developed a clinical utility analysis that weights the real-world impact of different prediction errors. This approach recognizes that missing a heart disease diagnosis carries a much higher cost than generating a false alarm. The results painted a stark picture of healthcare inequality embedded within our AI systems. Middle-aged women consistently showed negative clinical utility across all models, meaning the AI was literally causing more harm than good for this demographic. These women would have been better off with no AI assistance at all. In sharp contrast, older males received clinical benefits that were six times higher than other groups, suggesting that the models had inadvertently optimized their performance for this specific population. Perhaps most troubling was the discovery that optimal decision thresholds varied dramatically by demographic group, indicating that a one-size-fits-all approach to AI deployment could systematically disadvantage certain patient populations.
Using SHAP (SHapley Additive exPlanations) analysis, I peered inside these “black box” models to understand the mechanisms driving these disparities. This explainability technique revealed that calcium scores and thallium stress test results dominated predictions across all models, but their influence varied significantly between demographic groups. The analysis uncovered complex patterns where the same clinical features carried different predictive weight depending on a patient’s age and sex. Feature impacts were distributed differently across algorithms, with logistic regression spreading influence more evenly while tree-based models concentrated impact in fewer features. Most importantly, the analysis revealed subtle age-sex interactions that played crucial roles in prediction outcomes but weren’t immediately obvious from traditional model evaluation. This transparency was essential because understanding why bias exists is the first step toward eliminating it.
Implementing state-of-the-art fairness techniques proved to be a humbling experience that highlighted the limitations of current bias mitigation approaches. I deployed Fair-XGBoost with demographic parity constraints, threshold optimization techniques tailored for different groups, and balanced class weighting across all models. While these interventions showed modest improvements in some metrics, they failed to significantly reduce the most concerning disparities, particularly the severe underdiagnosis affecting younger women. The Fair-XGBoost model, despite being specifically designed to address fairness concerns, didn’t consistently outperform standard XGBoost across all fairness metrics. This finding suggests that current fairness methods aren’t sophisticated enough to handle the complex intersectional effects of age and sex in medical data, where multiple demographic factors interact in ways that simple constraint-based approaches cannot adequately address.
As healthcare systems worldwide rush to adopt AI-driven diagnostic tools, these findings raise profound questions about the future of medical care. We may be inadvertently creating a two-tier healthcare system where AI works exceptionally well for some patients while systematically failing others. The challenge lies in balancing overall model performance with fairness across demographic groups, particularly when improving fairness for one group might come at the cost of accuracy for another. This research suggests we need entirely new regulatory frameworks to ensure AI doesn’t amplify existing healthcare disparities, moving beyond traditional accuracy metrics to incorporate measures of fairness and equity as core requirements for clinical deployment.
This research illuminates several urgent needs across the AI and healthcare communities. For researchers, the findings underscore the necessity of mandatory age-disaggregated fairness audits for all clinical AI systems, moving beyond aggregate performance metrics to examine how models behave across specific demographic groups. We need intersectional analysis that considers multiple demographic factors simultaneously, recognizing that the combination of age and sex creates unique patterns of bias that neither factor alone would reveal. Additionally, the development of fairness metrics that account for the different costs of various prediction errors is crucial, as a false negative in heart disease diagnosis carries far greater consequences than a false positive.
For practitioners deploying these systems, the research highlights the critical importance of threshold calibration based on demographic groups and clinical context. The finding that optimal decision thresholds varied dramatically across groups suggests that one-size-fits-all deployment strategies may be fundamentally flawed. Healthcare institutions need transparency requirements that make model decisions interpretable to clinicians, allowing them to understand and potentially override AI recommendations when appropriate. Continuous monitoring for bias drift as models encounter new patient populations is equally essential, as bias patterns may evolve over time.
Policymakers face perhaps the greatest challenge in developing regulatory frameworks that require fairness testing before clinical deployment. These frameworks must establish standards for demographic representation in training datasets and provide clear guidelines for handling age-related and intersectional bias. The complexity of these issues suggests that traditional medical device approval processes may be inadequate for AI systems that exhibit such nuanced patterns of demographic bias.
This work represents just the beginning of a much larger conversation about fairness in clinical AI. The dataset I analyzed was relatively small and drawn from a single medical center in the 1980s, highlighting the need for validation in contemporary, diverse patient cohorts that better represent today’s healthcare landscape. Future research must expand beyond age and sex to include race, ethnicity, and socioeconomic factors, recognizing that bias operates across multiple intersecting dimensions. Most critically, we need prospective clinical trials of fairness-enhanced models to understand how these systems perform in real-world clinical settings, where the stakes are measured not in accuracy percentages but in human lives and wellbeing.
Machine learning has immense potential to revolutionise healthcare, but only if we ensure it works fairly for all patients. Age bias in AI isn’t just a technical problem—it’s a healthcare equity issue that could affect millions of people as they age. As we stand on the brink of an AI-driven healthcare revolution, we have a choice: We can deploy these systems as they are, perpetuating and potentially amplifying existing disparities, or we can do the hard work of making them fair. The stakes couldn’t be higher. After all, if we’re lucky, we’ll all be old someday.
This research was conducted at the University of Nottingham School of Computer Science in collaboration with Xinjie Wu, George Oteng, and Lingyu Fan. The full paper is available upon request.
Obed Johnson