OpenAI creates HealthBench, a tool to evaluate healthcare AI chat models

The makers of Chat-GPT aim to help HCOs evaluate the safety, reliability, and accuracy of user interactions with conversational AI systems.

By admin

May 12, 2025, 2:30 PM

OpenAI, the developers of Chat-GPT, have unveiled HealthBench, a tool to help evaluate the capabilities, safety, and utility of its artificial intelligence (AI) models when used to ask health-related queries. HealthBench, was in built in partnership with more than 250 physicians from around the world to ensure that responses to health-related queries, from clinical and non-clinical users alike, meet certain standards of accuracy and completeness.

“Improving human health will be one of the defining impacts of artificial general intelligence (AGI). If developed and deployed effectively, large language models (LLMs) have the potential to expand access to health information, support clinicians in delivering high-quality care, and help people advocate for their health and that of their communities,” said OpenAI in a blog post. “To get there, we need to ensure models are useful and safe. Evaluations are essential to understanding how models perform in health settings.”

HealthBench includes 5000 simulated conversations designed to mimic realistic use of the conversational interfaces of large language models like Chat-GPT. Some of the use cases were generated synthetically, while others were created by humans.

Sample interactions included a person asking what to do after finding their older neighbor unresponsive on the ground, a query about what ongoing headaches could mean, and a physician’s question about the next best steps to monitor a patient who recently began a new medication.

Physicians then evaluated the responses for clinically sound advice, important cautions and caveats, and clear, appropriate suggestions for follow-up actions. Each response was then given a score based on a rubric tailored to that interaction.

“Each criterion has a corresponding point value, weighted to match the physician’s judgment of that criterion’s importance,” the researchers explained. “HealthBench contains 48,562 unique rubric criteria, providing extensive coverage of specific facets of model performance. Model responses are evaluated by a model-based grader (GPT4.1) to assess whether each rubric criterion is met, and responses receive an overall score based on the total score of criteria met compared to the maximum possible score.”

One example, in which a user asked if a specific medication was effective for preventing viral infections, scored only 4% on its associated rubric. The response lacked scientific evidence for its assertions, failed to note that there is no established consensus on the effectiveness of the medication, and didn’t suggest considering potential drug interactions when adding the medication to a drug regimen.

On the other hand, the question about what to do with an unresponsive neighbor earned a score of 77% due to its clear and concise advice to call emergency services at the beginning of the response and details around how to communicate the nature of the situation to first responders.

In addition to evaluating how current models respond to common inputs, OpenAI’s researchers wanted to compare its models’ responses with the advice of physicians unaided by AI. They asked the physicians to write their own expert advice for each of the use cases or edit and augment responses from AI models.

The team found that using models available in 2024, physician-written responses outperformed AI-written responses. However, once newer models were released in 2025, the human responses were no longer significantly better than those generated by AI.

“Our findings show that large language models have improved significantly over time and already outperform experts in writing responses to examples tested in our benchmark. Yet even the most advanced systems still have substantial room for improvement, particularly in seeking necessary context for underspecified queries and worst-case reliability. We look forward to sharing results for future models.”

Jennifer Bresnick is a journalist and freelance content creator with a decade of experience in the health IT industry. Her work has focused on leveraging innovative technology tools to create value, improve health equity, and achieve the promises of the learning health system. She can be reached at [email protected].

Show Your Support

Subscribe to our topic-centric newsletters to get the latest insights delivered to your inbox weekly.

Enter your information below

By submitting this form, you are agreeing to DHI’s Privacy Policy and Terms of Use.

OpenAI creates HealthBench, a tool to evaluate healthcare AI chat models

Show Your Support

Subscribe

Explore

REACH OUR AUDIENCE

Featured Topics