{"id":197372,"date":"2025-10-08T11:19:20","date_gmt":"2025-10-08T11:19:20","guid":{"rendered":"https:\/\/www.newsbeep.com\/ca\/197372\/"},"modified":"2025-10-08T11:19:20","modified_gmt":"2025-10-08T11:19:20","slug":"a-stanford-and-harvard-backed-lab-dedicated-to-objectively-validating-ai-in-healthcare","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ca\/197372\/","title":{"rendered":"A Stanford And Harvard Backed Lab Dedicated To Objectively Validating AI in Healthcare"},"content":{"rendered":"<p><img decoding=\"async\" class=\" top-image\" src=\"https:\/\/specials-images.forbesimg.com\/imageserve\/68e45b77ca6427805b573bac\/Objective-and-transparent-evaluation-of-AI-models-in-healthcare-has-never-been-more\/960x0.jpg?fit=scale\" alt=\"Objective and transparent evaluation of AI models in healthcare has never been more important.\" data-height=\"2427\" data-width=\"4315\" fetchpriority=\"high\" style=\"position:absolute;top:0\"\/><\/p>\n<p>Objective and transparent evaluation of AI models in healthcare has never been more important.<\/p>\n<p>getty<\/p>\n<p>The application of AI technology in healthcare is likely one of the most important and substantial contributions of human kind in the 21st century. The work in this arena has been monumental, with large language models that are now competitive with (and often can out-compete) human physicians in reasoning, aptitude and breadth of knowledge. For example, <a href=\"https:\/\/research.google\/blog\/advancing-medical-ai-with-med-gemini\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/research.google\/blog\/advancing-medical-ai-with-med-gemini\/\" aria-label=\"Med-Gemini\">Med-Gemini<\/a> was found to be 91% accurate in early benchmark tests for the United States Medical Licensing Exam (USMLE). Early versions of <a href=\"https:\/\/journals.plos.org\/digitalhealth\/article?id=10.1371\/journal.pdig.0000198\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/journals.plos.org\/digitalhealth\/article?id=10.1371\/journal.pdig.0000198\" aria-label=\"ChatGPT\">ChatGPT<\/a> were found to achieve the passing threshold for the USMLE as well.<\/p>\n<p>Nevertheless, the technology has since evolved far beyond just passing simple written exam questions; now, healthcare and technology enthusiasts are actively incorporating the latest reasoning models into actual clinical practice and patient care modalities. This means that research, testing and objective benchmarking of the performance, safety and efficacy of these models should be of utmost importance. <\/p>\n<p>This is the exact mission of the <a class=\"color-link\" href=\"https:\/\/www.arise-ai.org\/about\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/www.arise-ai.org\/about\" aria-label=\"ARisE\">ARisE<\/a> network, established in 2024 as a collaborative unit between clinicians and deep research experts across academic and medical centers to develop, test and rigorously assess AI outputs and solutions in healthcare delivery. The mission for ARiSE is relatively straightforward; Dr. Ethan Goh, executive director of the collaborative, explains that the primary focus goes beyond just model performance and technical build efficacy, which is what the majority of research in this arena currently entails. Instead, he passionately describes ARiSE as focused on \u201cclinical reasoning, safety and explainability\u201d as a means to answer a fundamental question: \u201cCan AI safely reason like a doctor in real-world care?\u201d <\/p>\n<p>In this effort, ARiSE has been able to produce ground-breaking and industry moving work. <\/p>\n<p>An incredibly powerful paper the group published earlier this year in <a href=\"https:\/\/www.nature.com\/articles\/s41591-024-03456-y\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/www.nature.com\/articles\/s41591-024-03456-y\" aria-label=\"Nature Medicine\">Nature Medicine<\/a> highlighted that large language models can significantly assist physician reasoning in complex clinical vignettes when compared to conventional resources. <\/p>\n<p>Another <a href=\"https:\/\/ai.nejm.org\/doi\/10.1056\/AIdbp2500144\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/ai.nejm.org\/doi\/10.1056\/AIdbp2500144\" aria-label=\"seminal paper\">seminal paper<\/a> by the group published in the New England Journal of Medicine (NEJM) AI in August of this year evaluated the role and efficacy of large language model agents using a proprietary evaluation platform, MedAgentBench. The platform entails 300 patient-specific scenarios\/tasks written by human physicians and was used to evaluate nearly all of the latest models in an agentic framework to determine their success ratios in completing the tasks. The study found that nearly all of the LLMs performed significantly better at query-based tasks rather than action-based tasks, with Claude 3.5 Sonnet having the highest overall success rate of 69.67%. The authors concluded that there is a \u201cneed for improved LLM capabilities in navigating complex, action-based decision-making tasks.\u201d <\/p>\n<p>A <a href=\"https:\/\/arxiv.org\/abs\/2412.10849\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/arxiv.org\/abs\/2412.10849\" aria-label=\"third paper\">third paper<\/a> by the group produced even more astounding results; it describes a study which compared LLMs and board-certified physicians across three clinical scenarios: emergency room triage, initial evaluation by a physician, and admission to the hospital or intensive care unit. The researchers found that in all scenarios, the LLM emerged clearly superior and capable of displaying superhuman diagnostic and reasoning abilities.<\/p>\n<p>Certainly, ARiSE is not the only player in the field working to pursue benchmark testing and evaluate AI models.  OpenAI itself announced its <a href=\"https:\/\/openai.com\/index\/healthbench\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/openai.com\/index\/healthbench\/\" aria-label=\"HealthBench\">HealthBench<\/a> rubric earlier this year to better test its own tools and model efficacy in complex clinical scenarios. HealthBench was built by 262 physicians across 60 countries and includes 5,000 health conversations, \u201ceach with a custom physician-created rubric to grade model responses.\u201d Another example is Google\u2019s <a href=\"https:\/\/research.google\/blog\/benchmarking-llms-for-global-health\/\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/research.google\/blog\/benchmarking-llms-for-global-health\/\" aria-label=\"global health\">global health<\/a> LLM benchmarking effort, which entails \u201c11,000+ manually and LLM-generated personas representing a broad array of tropical and infectious diseases across demographic, contextual, location, language, clinical, and consumer augmentations.\u201d Its goal is to understand LLMs\u2019 performance across \u201ctasks involving distribution shifts in disease types, region-specific medical knowledge, and contextual variations across symptoms, language, location, linguistic diversity, and localized cultural contexts.\u201d Despite how quickly the field of model evaluation is growing, the most crucial aspect of ARiSE is that it has no intention of being commercial in nature; rather, it intends to stay independent and objective.<\/p>\n<p>Dr. Adam Rodman, a physician and one of the key research leaders at ARiSE, reiterates the group\u2019s commitment: to push model development in a way that is meaningful for the field by designing impactful benchmarks and tests to deeply ascertain what these systems will do in the real world.<\/p>\n<p>Why is all of this so important?<\/p>\n<p>The world is no longer in a place where LLMs and AI applications are being used simply for written exams or theoretical referencing. Rather, the system has reached somewhat of a critical mass where models are now being actively pursued and employed in actual, clinical scenarios which are often wrought with critical patient care decisions.<\/p>\n<p>An article via <a href=\"https:\/\/huggingface.co\/blog\/leaderboard-medicalllm\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/huggingface.co\/blog\/leaderboard-medicalllm\" aria-label=\"HuggingFace\">HuggingFace<\/a> describes the value of evaluation exercises as well the purpose behind its own model leaderboard initiative: \u201cWhen models are used for recreational conversational aspects, errors have little repercussions; this is not the case for uses in the medical domain however, where wrong explanations and answers can have severe consequences for patient care and outcomes.\u201d<\/p>\n<p>Take for example the plethora of ambient scribing companies that have rapidly become main-stay in many practices; an <a href=\"https:\/\/www.aafp.org\/pubs\/afp\/issues\/2025\/0400\/graham-center-artificial-intelligence-scribes.html\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/www.aafp.org\/pubs\/afp\/issues\/2025\/0400\/graham-center-artificial-intelligence-scribes.html\" aria-label=\"article\">article<\/a> published in the American Academy of Family Physicians\u2019 (AAFP) journal reported that \u201cinitial results from AI integrated organizations show that AI scribe programs significantly decrease electronic health record workload, and as physician burnout rates remain high, there is greater interest in AI tools.\u201d <a class=\"color-link\" href=\"https:\/\/www.mordorintelligence.com\/industry-reports\/ai-powered-clinical-decision-support-market\" target=\"_blank\" rel=\"nofollow noopener noreferrer\" data-ga-track=\"ExternalLink:https:\/\/www.mordorintelligence.com\/industry-reports\/ai-powered-clinical-decision-support-market\" aria-label=\"Studies\">Studies<\/a> are similarly indicating that AI powered clinical decision support (CDS) tools are increasingly being used; in fact, the market share is expected to reach $1.79 billion by 2030, at a CAGR of 15.6%.<\/p>\n<p>Indeed, the demand for AI in actual clinical settings is sky-rocketing, especially given the rampant physician burnout that is pervasive, increasing costs amidst a labor shortage, and a growing population with chronic health conditions. Healthcare is at a critical breaking point, and the industry is turning to technology as a key saving grace. Therefore, objective analytics, transparent benchmarks and impartial studies can no longer be treated as optional, but must instead be viewed as a cornerstone of innovation.<\/p>\n","protected":false},"excerpt":{"rendered":"Objective and transparent evaluation of AI models in healthcare has never been more important. getty The application of&hellip;\n","protected":false},"author":2,"featured_media":197373,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[34],"tags":[971,4587,49,48,46,192,84,392,12169,124,281,278,5028,3277],"class_list":{"0":"post-197372","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-healthcare","8":"tag-amazon","9":"tag-america","10":"tag-ca","11":"tag-canada","12":"tag-economy","13":"tag-google","14":"tag-health","15":"tag-healthcare","16":"tag-hiring","17":"tag-jobs","18":"tag-meta","19":"tag-openai","20":"tag-tesla","21":"tag-trump"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/197372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/comments?post=197372"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/197372\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media\/197373"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media?parent=197372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/categories?post=197372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/tags?post=197372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}