{"id":511159,"date":"2026-04-03T17:59:43","date_gmt":"2026-04-03T17:59:43","guid":{"rendered":"https:\/\/www.newsbeep.com\/uk\/511159\/"},"modified":"2026-04-03T17:59:43","modified_gmt":"2026-04-03T17:59:43","slug":"there-are-more-ai-health-tools-than-ever-but-how-well-do-they-work","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/uk\/511159\/","title":{"rendered":"There are more AI health tools than ever\u2014but how well do they work?"},"content":{"rendered":"<p>Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in his study, before being released to the public. That might be a heavy lift, particularly given how fast the AI world moves and how long human studies can take. Bean\u2019s own study used GPT-4o, which came out almost a year ago and is now outdated.\u00a0<\/p>\n<p>Earlier this month, Google released a study that meets Bean\u2019s standards. In the study, patients discussed medical concerns with the company\u2019s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is not yet available to the public, before meeting with a human physician. Overall, AMIE\u2019s diagnoses were just as accurate as physicians\u2019, and none of the conversations raised major safety concerns for researchers.\u00a0<\/p>\n<p>Despite the encouraging results, Google isn\u2019t planning to release AMIE anytime soon. \u201cWhile the research has advanced, there are significant limitations that must be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,\u201d wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it is building in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.<\/p>\n<p>Rodman, who led the AMIE study with Karthikesalingam, doesn\u2019t think such extensive, multiyear studies are necessarily the right approach for chatbots like ChatGPT Health and Copilot Health. \u201cThere\u2019s lots of reasons that the clinical trial paradigm doesn\u2019t always work in generative AI,\u201d he says. \u201cAnd that\u2019s where this benchmarking conversation comes in. Are there benchmarks [from] a trusted third party that we can agree are meaningful, that the labs can hold themselves to?\u201d<\/p>\n<p>They key there is \u201cthird party.\u201d No matter how extensively companies evaluate their own products, it\u2019s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but if there are many third parties involved, it also helps protect against blind spots.<\/p>\n<p>OpenAI\u2019s Singhal says he\u2019s strongly in favor of external evaluation. \u201cWe try our best to support the community,\u201d he says. \u201cPart of why we put out HealthBench was actually to give the community and other model developers an example of what a very good evaluation looks like.\u201d\u00a0<\/p>\n<p>Given how expensive it is to produce a high-quality evaluation, he says, he\u2019s skeptical that any individual academic laboratory would be able to produce what he calls \u201cthe one evaluation to rule them all.\u201d But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites\u2014such as Stanford\u2019s MedHELM framework, which tests models on a wide variety of medical tasks. Currently, OpenAI\u2019s GPT-5 holds the highest MedHELM score.<\/p>\n<p>Nigam Shah, a professor of medicine at Stanford University who led the MedHELM project, says it has limitations. In particular, it only evaluates individual chatbot responses, but someone who\u2019s seeking medical advice from a chatbot tool might engage it in a multi-turn, back-and-forth conversation. He says that he and some collaborators are gearing up to build an evaluation that can score those complex conversations, but that it will take time, and money. \u201cYou and I have zero ability to stop these companies from releasing [health-oriented products], so they\u2019re going to do whatever they damn please,\u201d he says. \u201cThe only thing people like us can do is find a way to fund the benchmark.\u201d<\/p>\n<p>No one interviewed for this article argued that health LLMs need to perform perfectly on third-party evaluations in order to be released. Doctors themselves make mistakes\u2014and for someone who has only occasional access to a doctor, a consistently accessible LLM that sometimes messes up could still be a huge improvement over the status quo, as long as its errors aren\u2019t too grave.\u00a0<\/p>\n<p>With the current state of the evidence, however, it\u2019s impossible to know for sure whether the currently available tools do in fact constitute an improvement, or whether their risks outweigh their benefits. <\/p>\n","protected":false},"excerpt":{"rendered":"Ideally, Bean says, health chatbots would be subjected to controlled tests with human users, as they were in&hellip;\n","protected":false},"author":2,"featured_media":511160,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43],"tags":[102,2960,56,54,55],"class_list":{"0":"post-511159","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-healthcare","8":"tag-health","9":"tag-healthcare","10":"tag-uk","11":"tag-united-kingdom","12":"tag-unitedkingdom"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/511159","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/comments?post=511159"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/511159\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media\/511160"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media?parent=511159"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/categories?post=511159"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/tags?post=511159"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}