ChatGPT Health shows worrisome “blind spots” in real-world clinical test cases

A study from Mount Sinai indicates ChatGPT Health shows lackluster performance on triage and urgent care advice for patients.

By admin

Mar 3, 2026, 9:38 AM

ChatGPT Health shouldn’t be replacing your triage nurses any time soon, according to a new study out of Mount Sinai Health System, published in Nature Medicine.

The consumer-facing AI healthcare tool, launched earlier this year to great fanfare from OpenAI, shows notable deficiencies in its ability to guide patients to appropriate care – especially when they’re facing likely clinical emergencies.

When researchers conducted a stress test of the platform with a series of clinician-authored vignettes, ChatGPT Health under-triaged more than half (52%) of cases, often directing patients to watch and wait rather than seek emergency care for situations that human clinicians would recognize as urgent.

The errors were most egregious on the tail ends of the clinical bell curve, the authors explained, with the most serious failures occurring at the extremes of both non-urgent presentations and true emergency conditions.

“ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” said lead author Ashwin Ramaswamy, MD, Instructor of Urology at the Icahn School of Medicine at Mount Sinai, in a press release. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most. In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment.” 

The system was more likely to advise patients to stay home when the user reported that family members or friends minimized the severity of their symptoms, showing a potentially dangerous people-pleasing bias in how it processed and interpreted user-generated information.

“LLMs have become patients’ first stop for medical advice—but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm,” said Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research. “When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional.” 

The research team pointed out one particularly troubling blind spot in the model’s engagement with users: an inconsistency in displaying suicide support resources like the 988 Suicide and Crisis Lifeline when users expressed self-harm intentions, despite the model being coded to present the resources in all relevant situations.

In reality, the alerts were sometimes triggered in lower-risk conversations while failing to appear in conversations when the user explicitly discussed a plan or intention to engage in self-harm.

“This was a particularly surprising and concerning finding,” says senior and co-corresponding study author Girish N. Nadkarni, MD, MPH,  Chief AI Officer of the Mount Sinai Health System. “While we expected some variability, what we observed went beyond inconsistency. The system’s alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves. In real life, when someone talks about exactly how they would harm themselves, that’s a sign of more immediate and serious danger, not less.”  

The results of the study indicate a need to thoroughly evaluate the performance, accuracy, and reliability of AI models before deploying them to patients, the authors said. And since AI platforms like ChatGPT Health are continually updating and evolving, it’s critical to make these evaluations both frequent and consistent.

The team plans to do just that in the near future, including exploring performance in other areas such as medication safety, pediatric care, and non-English use.

For now, clinicians and health system leaders should work to develop patient-facing education on the limitations of chatbots for triage, and ensure that experienced human clinicians are always available to provide advice and direct patients to the appropriate care setting with the right degree of urgency.

Jennifer Bresnick is a journalist and freelance content creator with a decade of experience in the health IT industry. Her work has focused on leveraging innovative technology tools to create value, improve health equity, and achieve the promises of the learning health system. She can be reached at [email protected].

Show Your Support

Subscribe to our topic-centric newsletters to get the latest insights delivered to your inbox weekly.

Enter your information below

By submitting this form, you are agreeing to DHI’s Privacy Policy and Terms of Use.

ChatGPT Health shows worrisome “blind spots” in real-world clinical test cases

Show Your Support

Subscribe

Explore

REACH OUR AUDIENCE

Featured Topics