Decades-old diagnostic system still has the edge on GenAI chatbots

A 1980s medical AI system DXplain beat ChatGPT and Gemini in diagnosing diseases, suggesting hybrid approach may be best for healthcare.

By admin

Jul 8, 2025, 11:00 AM

DXplain, a rule-based diagnostic system originally developed at Massachusetts General Hospital (MGH) in the 1980s, has outperformed modern generative AI models in diagnosing patient cases, according to a new study published in JAMA Network Open.

The study, carried out by researchers at the Mass General Brigham system’s Laboratory of Computer Science, compared three diagnostic decision-support systems (DDSS): DXplain, and two large language models (LLMs), ChatGPT and Gemini. The systems were tested across 36 patient cases that varied by race, age, gender, and symptoms, both with and without laboratory results.

The 40-year-old system that beat ChatGPT

DXplain, built on 40 years of medical knowledge, diagnosed correctly more often than both modern AI systems. With lab data included, DXplain correctly diagnosed 72 percent of cases, compared to 64 percent for ChatGPT and 58 percent for Gemini. Without lab data, DXplain correctly diagnosed 56 percent of cases, higher than ChatGPT’s 42 percent and Gemini’s 39 percent, though the difference wasn’t statistically significant.

DXplain’s origins trace back to 1984, when researchers at the Massachusetts General Hospital Laboratory of Computer Science started building what would become one of the first expert systems in medicine. The system relies on a large database of symptoms, diseases, and programmed medical logic, allowing physicians to input patient information to generate ranked diagnostic possibilities, along with explanations and follow-up recommendations.

In its early days, DXplain was among the only computerized aids available to clinicians. Newer AI systems like ChatGPT learned to mimic human conversation, summarizing large bodies of information, but often can’t explain their reasoning or show their medical logic. Unlike these newer models, older systems like DXplain are built to show their reasoning and avoid common diagnostic pitfalls through structured clinical logic.

Different approaches, different strengths

AI chatbots have quickly become popular in hospitals for features like digesting clinician notes, drafting radiology reports, and providing second opinions. Trained on massive datasets and using pattern recognition, these systems can sometimes work better than humans at specific tasks. But there’s a problem. Unlike rule-based systems, LLMs are trained on general internet content and may hallucinate facts, misinterpret labs, or offer terse medical advice without supporting evidence. This error rate increases when dealing with atypical cases.

Despite DXplain’s better overall results, researchers said the answer isn’t using DXplain alone, it’s to combine it with new AI tools. They found that DXplain and the two LLMs had different strengths, with each system identifying correct diagnoses missed by the other.

“Now, we think combining the powerful explanatory capabilities of existing diagnostic systems with the linguistic capabilities of large language models will enable better automated diagnostic decision support and patient outcomes,” said Dr. Mitchell Feldman, MD, co-author of the study, in the Mass General Brigham press release.

Researchers are testing a new method: use LLMs to extract unstructured clinical data, narrative notes, symptoms, and histories, then feed that information into a rigorously structured diagnostic system like DXplain.

Newer may not be better

Just a few years ago, “AI” referred to rule-based alerts triggered in electronic health records. Today, the term most often conjures images of LLM-enhanced drafting of discharge summaries or chatbot triage. Without proper clinical testing, these innovations may not deliver on their promises.

The results of the study back up earlier research showing that combined systems, which pair neural network agility with logical consistency, work better than either system alone. This could change how hospitals use AI. In preliminary efforts underway at MGH, LLMs are being trained to pull clinical findings from narrative text, organize them into structured formats, and then pass them to a DDSS like DXplain. If successful, clinicians could benefit from both the richness of generative AI and the transparent logic of expert systems.

Dr. Feldman’s team sees this combination as the right way forward. “Amid all the interest in large language models, it’s easy to forget that the first AI systems used successfully in medicine were expert systems like DXplain,” said Dr. Edward Hoffer, another one of the study’s co-authors.

It won’t be easy, but it will be worth it

Getting new AI tools to work with older systems won’t be easy. Extracting reliable clinical findings from narrative text is complex, especially given the variability in documentation styles. Lab result integration, version control, regulatory approval, and randomized controlled trials all lie ahead.

DXplain and similar systems, while clear in their reasoning, often need manual updates and require continual curation. Their disease databases may not reflect the latest research, novel therapies, or emerging pathogens, an area where generative AI excels. Even so, DXplain’s win shows us that well-engineered systems, designed for specific jobs and improved over decades, can beat out newer tools.

But Drs. Feldman and Hoffer say healthcare AI shouldn’t swing from one extreme to another: it should build on existing strengths. Rule-based systems provide accuracy and conversational AI can interpret unclear information. Together, they can help clinicians diagnose patients faster and more accurately. As hospitals rush to adopt new AI tools, this study offers both warning and encouragement. New AI technology works best when hospitals use it alongside their existing tools as part of doctor-led efforts to treat patients.