Are AI chatbots worth it when patients aren’t getting the right answers?

AI companies are rolling out health-specific chatbots, but patients aren’t really good at using them yet.

By admin

Feb 10, 2026, 2:23 PM

OpenAI, Anthropic, and Amazon all made headlines early this year for rolling out healthcare specific toolkits for everyday consumers, proclaiming the start of a new era of personalized health education and empowered decision-making.

With broad, established usage of AI chatbots like ChatGPT and Claude for healthcare questions, including tens of millions of health-related queries every week, it makes sense to develop specific tools and experiences for these activities while putting additional privacy and security guardrails around them.

But a tool is only as valuable as its user. And new research published in Nature Medicine this month indicates that the average patient isn’t all that great at using large language models (LLMs) to guide their healthcare decision-making just yet.

The study of nearly 1300 patients in the United Kingdom found that while LLMs were near perfect at identifying health conditions, and fairly accomplished at suggesting the correct action for someone with that condition, average human users were dramatically worse at both tasks.

Furthermore, the outcomes for using the AI chatbots were no better than a control group, who used typical internet sources for the task.

The findings highlight the risks of starting to rely too heavily on patients to engage with LLMs and correctly interpret what to do with the provided information, no matter how powerful the technology provided to them.

What the study showed

Researchers worked with human physicians to create ten healthcare scenarios with patient characteristics and symptoms, and ensured that all of the physicians unanimously agreed on the most appropriate course of action for the fictional patient at the center of the situation.

The team then tested three LLMs (GPT-4o, Llama 3, Command R+) on the scenarios, asking the models to first identify the condition based on the data provided.

Tested on their own, the models correctly identified the conditions in 94.9% of cases. When asked to suggest a medically appropriate course of action (disposition), the models succeeded 56.3% of the time, on average.

Next, the researchers asked 1298 participants to complete the same tasks. Participants were randomly asked to use one of the three models or to conduct research through whatever other means they would typically use, as a control group.

The outcomes differed dramatically. LLM users identified relevant conditions in fewer than 34.5% of cases, and uncovered the correct course of action in fewer than 44.2%, both of which were not statically different than the control group.

In fact, users in the control group had 1.76 times higher odds of identifying a relevant condition, and were 1.52 times more likely to identify conditions from the more serious “red flag” list, the authors said. In addition, participants using LLMs tended to underestimate the acuity of their conditions.

Humans being humans lowers the accuracy and utility of AI chatbots

Why the disconnect? Because ordinary people are…well, just people. They forgot things, added quirky details, and misinterpreted results.

When the team dug deeper into the conversations, they found that human users didn’t always communicate all the necessary information from the provided scenario to the model, which altered the way the AI chatbots made their differential diagnoses.

And while the LLMs suggested an average of 2.21 possible conditions per scenario (only 34% of which were correct), the humans only listed an average of 1.33 afterwards, which suggests they didn’t remember or decided to discount certain conditions on their own.

Users also added new information partway through the conversation with the chatbot that sometimes altered its output.

However, in some cases, the LLMs were to blame. The models either provided incorrect information or focused in on terms that were not relevant to the task at hand, skewing the conversation in an unwanted direction. Users were only sometimes successful at redirecting the chat and eventually extracting the desired information.

How to view healthcare AI chatbots in light of this study

The study should serve as an important reminder for patients (and their care providers) that access to LLMs doesn’t automatically turn average people into master diagnosticians, and just because there’s a new “health” label on their favorite chatbot doesn’t mean that the underlying technology is any better at guiding them in the right direction.

Humans are still fallible, and no two minds think exactly alike, which means no two conversations with a chatbot will ever be exactly the same. Those without clinical training still need a significant amount of education and support to understand how to interact with these tools, interpret the results, and take action when necessary.

Providers should focus on helping patients build critical thinking skills while emphasizing the importance of trusting their bodies when something really feels wrong. Making experienced clinicians available for triage and consults during off-hours can help to ensure patients always have a trusted human available when questions arise.

Somehow, AI chatbots will have to slot into these larger, ongoing efforts to equip patients with the knowledge they need to access care in a timely and appropriate manner. By combining the real utility of LLMs with the right education for lay-users, the healthcare community can help individuals extract better results from AI chatbots and achieve better outcomes now and in the future.

Jennifer Bresnick is a journalist and freelance content creator with a decade of experience in the health IT industry. Her work has focused on leveraging innovative technology tools to create value, improve health equity, and achieve the promises of the learning health system. She can be reached at [email protected].