Objective and transparent evaluation of AI models in healthcare has never been more important.
getty
The application of AI technology in healthcare is likely one of the most important and substantial contributions of human kind in the 21st century. The work in this arena has been monumental, with large language models that are now competitive with (and often can out-compete) human physicians in reasoning, aptitude and breadth of knowledge. For example, Med-Gemini was found to be 91% accurate in early benchmark tests for the United States Medical Licensing Exam (USMLE). Early versions of ChatGPT were found to achieve the passing threshold for the USMLE as well.
Nevertheless, the technology has since evolved far beyond just passing simple written exam questions; now, healthcare and technology enthusiasts are actively incorporating the latest reasoning models into actual clinical practice and patient care modalities. This means that research, testing and objective benchmarking of the performance, safety and efficacy of these models should be of utmost importance.
This is the exact mission of the ARisE network, established in 2024 as a collaborative unit between clinicians and deep research experts across academic and medical centers to develop, test and rigorously assess AI outputs and solutions in healthcare delivery. The mission for ARiSE is relatively straightforward; Dr. Ethan Goh, executive director of the collaborative, explains that the primary focus goes beyond just model performance and technical build efficacy, which is what the majority of research in this arena currently entails. Instead, he passionately describes ARiSE as focused on “clinical reasoning, safety and explainability” as a means to answer a fundamental question: “Can AI safely reason like a doctor in real-world care?”
In this effort, ARiSE has been able to produce ground-breaking and industry moving work.
An incredibly powerful paper the group published earlier this year in Nature Medicine highlighted that large language models can significantly assist physician reasoning in complex clinical vignettes when compared to conventional resources.
Another seminal paper by the group published in the New England Journal of Medicine (NEJM) AI in August of this year evaluated the role and efficacy of large language model agents using a proprietary evaluation platform, MedAgentBench. The platform entails 300 patient-specific scenarios/tasks written by human physicians and was used to evaluate nearly all of the latest models in an agentic framework to determine their success ratios in completing the tasks. The study found that nearly all of the LLMs performed significantly better at query-based tasks rather than action-based tasks, with Claude 3.5 Sonnet having the highest overall success rate of 69.67%. The authors concluded that there is a “need for improved LLM capabilities in navigating complex, action-based decision-making tasks.”
A third paper by the group produced even more astounding results; it describes a study which compared LLMs and board-certified physicians across three clinical scenarios: emergency room triage, initial evaluation by a physician, and admission to the hospital or intensive care unit. The researchers found that in all scenarios, the LLM emerged clearly superior and capable of displaying superhuman diagnostic and reasoning abilities.
Certainly, ARiSE is not the only player in the field working to pursue benchmark testing and evaluate AI models. OpenAI itself announced its HealthBench rubric earlier this year to better test its own tools and model efficacy in complex clinical scenarios. HealthBench was built by 262 physicians across 60 countries and includes 5,000 health conversations, “each with a custom physician-created rubric to grade model responses.” Another example is Google’s global health LLM benchmarking effort, which entails “11,000+ manually and LLM-generated personas representing a broad array of tropical and infectious diseases across demographic, contextual, location, language, clinical, and consumer augmentations.” Its goal is to understand LLMs’ performance across “tasks involving distribution shifts in disease types, region-specific medical knowledge, and contextual variations across symptoms, language, location, linguistic diversity, and localized cultural contexts.” Despite how quickly the field of model evaluation is growing, the most crucial aspect of ARiSE is that it has no intention of being commercial in nature; rather, it intends to stay independent and objective.
Dr. Adam Rodman, a physician and one of the key research leaders at ARiSE, reiterates the group’s commitment: to push model development in a way that is meaningful for the field by designing impactful benchmarks and tests to deeply ascertain what these systems will do in the real world.
Why is all of this so important?
The world is no longer in a place where LLMs and AI applications are being used simply for written exams or theoretical referencing. Rather, the system has reached somewhat of a critical mass where models are now being actively pursued and employed in actual, clinical scenarios which are often wrought with critical patient care decisions.
An article via HuggingFace describes the value of evaluation exercises as well the purpose behind its own model leaderboard initiative: “When models are used for recreational conversational aspects, errors have little repercussions; this is not the case for uses in the medical domain however, where wrong explanations and answers can have severe consequences for patient care and outcomes.”
Take for example the plethora of ambient scribing companies that have rapidly become main-stay in many practices; an article published in the American Academy of Family Physicians’ (AAFP) journal reported that “initial results from AI integrated organizations show that AI scribe programs significantly decrease electronic health record workload, and as physician burnout rates remain high, there is greater interest in AI tools.” Studies are similarly indicating that AI powered clinical decision support (CDS) tools are increasingly being used; in fact, the market share is expected to reach $1.79 billion by 2030, at a CAGR of 15.6%.
Indeed, the demand for AI in actual clinical settings is sky-rocketing, especially given the rampant physician burnout that is pervasive, increasing costs amidst a labor shortage, and a growing population with chronic health conditions. Healthcare is at a critical breaking point, and the industry is turning to technology as a key saving grace. Therefore, objective analytics, transparent benchmarks and impartial studies can no longer be treated as optional, but must instead be viewed as a cornerstone of innovation.