AI scribes are everywhere in healthcare, but evaluating them is messy

Researchers have developed a framework to test AI scribing tools as hospitals struggle to validate performance beyond user satisfaction surveys.

By admin

Jul 25, 2025, 12:32 PM

Ambient digital scribing (ADS) tools have become the poster child for artificial intelligence in healthcare, promising to free doctors from the documentation that keeps them typing late into the night. But as these AI-powered scribes proliferate across hospitals and clinics, no one really knows how to properly evaluate whether they work.

A new study published in Nature tackles the challenge of evaluating ADS tools by proposing an evaluation framework for them. The research exposes significant gaps in how healthcare organizations have been assessing these ubiquitous tools.

“Current evaluation strategies for ADS tools remain insufficient, primarily emphasizing user satisfaction through surveys and performance assessments relying on expert evaluations,” notes the study’s authors. “These evaluations are either solely based on human-driven qualitative evaluation, or automated evaluation like ROUGE, Word Error Rate (WER), and F1 scores, which are not fully tailored to the complexities of clinical workflows.”

The concern over ADS tool evaluation is not just academic. This problem has a direct effect on patient care, as well as healthcare economics, with the global market for AI in healthcare exploding toward a projected value of more than $500 billion by 2032.

Satisfaction surveys — missing the obvious

The study draws on data from earlier research to note that clinicians spend an average of nearly 2 hours every day on EHR tasks, much of it during “pajama time” after hours. Additional studies have shown that 78% of clinicians report improved efficiency with AI scribes like DAX Copilot, and 21% of physicians now use AI for documentation of billing codes, medical charts, or visit notes, up from 13% in 2023.

The researchers, who developed their own AI scribing tool for testing purposes, found that ambient AI systems struggle in ways that user surveys might miss entirely.

When researchers tested their system with artificially corrupted data to simulate the kind of transcription errors that occur in real clinical settings, performance declined across all metrics. More troubling, when they fed the system unrealistic lab values, such as negative potassium levels, the AI retained these impossible results in 60% of cases, sometimes “correcting” them without explanation.

“These findings highlight significant gaps in the system’s ability to address adversarial or nonsensical data inputs,” the researchers wrote.

SCRIBE framework unites four different approaches

The study introduces a new framework, SCRIBE (Simulation, Computational metrics, Reviewer assessment, and Intelligent Evaluations for Best practice) as a more rigorous way to test ADS tools. Unlike current approaches that rely heavily on human evaluation or basic automated metrics, SCRIBE combines multiple assessment methods to capture different aspects of performance.

The framework recognizes that “no single method captures all performance dimensions.” Human reviewers catch clinical nuances but are subjective and don’t scale. Automated metrics provide objective benchmarks but miss contextual understanding. Large language models (LLMs) as evaluators blend human-like reasoning with machine consistency, while simulation testing enables stress-testing scenarios that wouldn’t occur naturally.

This multi-pronged approach revealed telling patterns. When human evaluators and automated systems assessed the same AI-generated notes, they often disagreed. The researchers found that automated evaluation showed “weak correlation with human evaluation” with “correlation coefficients falling below 0.2.” Even among human evaluators, agreement was modest at 53.8%.

Embedded bias can’t be overlooked

Perhaps most concerning were the framework’s findings about bias and fairness. When researchers simulated patient encounters with different demographic profiles, they discovered statistically significant differences in how the AI system performed across racial groups.

Transcripts labeled as coming from Black patients showed different “toxicity scores” compared to those from white or Asian patients, though the researchers noted they couldn’t identify the specific cause of these disparities through manual review. This finding highlights a blind spot in current AI scribe deployment: most health systems aren’t systematically testing for bias.

“These findings suggest that further analysis is needed to determine the root causes and significance of the observed disparities in toxicity scores,” the researchers cautioned.

As AI scribes become standard in healthcare, embedded biases could systematically affect how patient encounters are documented, potentially influencing treatment decisions and care quality across different populations.

A governance overhaul

While AI scribes have achieved widespread adoption faster than almost any other healthcare AI application, the infrastructure for ongoing monitoring and quality assurance hasn’t kept pace.

The researchers aligned their framework with the Coalition for Health AI (CHAI) governance model, proposing different evaluation strategies for different deployment phases. During development, they recommend emphasizing human and automated evaluation. Pre-deployment should focus on simulation testing. During “silent deployment,” when systems run alongside but don’t replace current processes, doctors should serve as evaluators.

This staged approach acknowledges that AI scribes aren’t static tools that can be validated once and forgotten. They’re learning systems that may drift or encounter new scenarios over time, requiring continuous monitoring.

Stronger evaluation for better implementation

The findings have immediate practical implications for the hundreds of health systems currently implementing AI scribes, suggesting that user satisfaction, while important to track, is insufficient when trying to understand if these tools are improving care or introducing new risks.

For example, the research found that AI scribes faced extra struggles with new medications, making transcription errors that led to drugs being omitted entirely from clinical notes. For rare diseases, performance was better but still imperfect, as demonstrated in the case of Smith-Lemli-Opitz Syndrome being transcribed as “smith only opus.”

The study underscores the need for Healthcare IT leaders to ask the hard questions: Are these systems accurately capturing medication changes? How do they handle rare diseases or new drug names? Do they perform equally well across different patient populations?

Despite the challenges, the study’s authors don’t advocate against the use of AI scribes. Their internally developed tool performed well across most metrics, showing strong scores for clarity, completeness, and relevance. GPT-based notes consistently outperformed those generated by LLaMA models, suggesting that the choice of underlying AI technology matters.

The key insight is that healthcare organizations need better ways to evaluate these tools beyond user surveys and vendor claims. The SCRIBE framework offers one approach, but the broader principle is clear: as AI becomes more prevalent in clinical workflows, evaluation methods must become more sophisticated.

“Clinicians should carefully review, edit, and approve all AI-generated notes before they are integrated into patient records,” the researchers emphasized. “This human-in-the-loop approach not only mitigates potential errors but also preserves accountability within medical practice.”