AI vs. physicians: Harvard research finds LLM wins on multiple clinical tasks

The study authors argue their new research indicates that AI models need to be tested in prospective trials.

By admin

Jun 8, 2026, 9:39 AM

AI models have already proven that they can pass the United States Medical Licensing Examination (USMLE) with flying colors, but how does it fare in clinical practice? A new study published in Science, one among the flood of research evaluating AI’s capabilities in healthcare, set out to compare the clinical diagnostic reasoning capabilities of a large language model (LLM) to human physicians. The LLM outperformed physicians and older models in several ways.

“In no way are we advocating that AI is ready to replace physicians,” said Peter Brodeur, MD, co-author and a Harvard Medical School clinical fellow in medicine at Beth Israel Deaconess Medical Center. “We now have the feasibility data to go ahead and look at these things prospectively…how they interact with physicians.”

DHI spoke to Brodeur about the study and how future research could test AI in clinical settings to build the foundation for safe integration of AI into clinical workflows.

OpenAI o1 vs. physicians

The study compares the OpenAI o1 series, a group of AI models designed to have reasoning capabilities, to prior, non-reasoning models and human physician baselines. Through a series of experiments, the researchers analyzed the LLM’s ability to reason through challenging diagnostic cases, to provide clinical documentation and to work through clinical management planning.

Five of the experiments drew on published patient vignettes and prior studies. The LLM put to the test outstripped both prior models and the human physician baselines. But physicians do not practice medicine in a controlled environment with curated data.

The sixth experiment was designed to test how the LLM would fare in a clinical setting without curated data. Researchers wanted to understand how the LLM “performed with the randomness of the real world,” according to Brodeur.

Researchers pulled data from the electronic health records of patients admitted to Beth Israel Deaconess Medical Center emergency department and provided different touchpoints for the model: the time of triage, the end of the emergency medicine encounter and after the patients were admitted to the hospital, Brodeur explained. How would the model perform compared to human physicians at each of these points?

“It turns out it performs quite well at both the triage level and at the end of the ED encounter. And then as more information becomes available, we see converging performance against a human baseline,” Brodeur said.

Testing AI in clinical settings

The results of this study, its authors argue, suggest that AI tools need to undergo prospective trials to better understand how they can improve clinical care.

“We have a bunch of models that are super, super capable. It’s time to put them into the hands of a physician so that we can see whether physicians are making better decisions with them than with conventional resources,” said Brodeur. “We need to start doing those studies in a more robust way to figure out how exactly we are going to deploy this in clinical practice when that day comes.”

Benchmarks are an important element of evaluating AI models’ capabilities. But the benchmarks currently in use are saturated, according to Brodeur. Many models are scoring at or near 100% on performance evaluations.

“Models are performing so high that there’s no more room to capture how it is they’re improving,” Brodeur said. “We need to create more rigorous benchmarks for a lot of these models.”

As benchmarks evolve, future research needs to explore many questions about the efficacy and safety of AI tools in the hands of physicians and patients. This research from Harvard leveraged general medicine and emergency department cases. Physicians in different specialties – Brodeur offered radiology and surgery as two examples – will use AI tools in different ways.

“One of the big takeaways is to be proactive about trying to study these models. Listen to your physicians: How are they using it?” Brodeur said. “We all care about different things in different parts of our practice. We should all have some stake in studying how exactly it is that our physicians want and like to use these tools.”

Carrie Pallardy, a Chicago-based freelance writer and editor, began her career covering healthcare more than a decade ago. Her work has taken into many different industries, but covering healthcare delivery remains a constant focus. She can be reached at [email protected] or on LinkedIn.