{"id":606938,"date":"2026-04-14T20:22:12","date_gmt":"2026-04-14T20:22:12","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/606938\/"},"modified":"2026-04-14T20:22:12","modified_gmt":"2026-04-14T20:22:12","slug":"ai-fails-at-primary-patient-diagnosis-more-than-80-of-the-time-study-finds","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/606938\/","title":{"rendered":"AI fails at primary patient diagnosis more than 80% of the time, study finds"},"content":{"rendered":"<p>Generative artificial intelligence (AI) still lacks the reasoning processes needed for safe clinical use, a new study has found. <\/p>\n<p>          <img decoding=\"async\" class=\"c-ad__placeholder__logo\" src=\"https:\/\/static.euronews.com\/website\/images\/logos\/logo-euronews-stacked-outlined-72x72-grey-9.svg\" width=\"72\" height=\"72\" alt=\"\" loading=\"lazy\"\/><br \/>\n          ADVERTISEMENT<\/p>\n<p>          <img decoding=\"async\" class=\"c-ad__placeholder__logo\" src=\"https:\/\/static.euronews.com\/website\/images\/logos\/logo-euronews-stacked-outlined-72x72-grey-9.svg\" width=\"72\" height=\"72\" alt=\"\" loading=\"lazy\"\/><br \/>\n          ADVERTISEMENT<\/p>\n<p>AI chatbots have improved their diagnostic accuracy when presented with comprehensive clinical information, but still failed to produce an appropriate differential diagnosis more than 80% of the time, according to researchers at Mass General Brigham, a Boston-based non-profit hospital and research network and one of the largest health systems in the United States.<\/p>\n<p>The results of the study, published in the open-access <a href=\"https:\/\/jamanetwork.com\/journals\/jamanetworkopen\/fullarticle\/2847679\" target=\"_blank\" rel=\"noreferrer nofollow noopener\">JAMA Network Open<\/a> medical journal, found that large language models\u2019 (LLMs) fall short of the reasoning required for clinical use.<\/p>\n<p>\u201cDespite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment,\u201d said Marc Succi, co-author of the study. <\/p>\n<p>He added that AI cannot yet replicate differential diagnosis, which is central to clinical reasoning, and which he considers the \u201cart of medicine\u201d. <\/p>\n<p>Differential diagnosis is the first step for healthcare professionals to identify a condition, separating it from others with similar symptoms. <\/p>\n<p>How the models were tested<\/p>\n<p>The research team analysed the functioning of 21 LLMs, including the latest available versions of Claude, DeepSeek, Gemini, GPT and Grok.<\/p>\n<p>They evaluated the LLMs on 29 standardised clinical vignettes using a newly developed tool called PrIME-LLM. <\/p>\n<p>The tool assesses a model\u2019s ability across different stages of clinical reasoning: conducting an initial diagnosis, ordering appropriate tests, arriving at a final diagnosis, and planning treatment. <\/p>\n<p>To simulate how clinical cases unfold, the researchers gradually fed the models information, beginning with basics such as a patient\u2019s age, sex and symptoms, before adding physical examination findings and laboratory results. <\/p>\n<p>A differential diagnosis is critical in a real-world clinical setting to advance to the next step. However, in the study, the models were given additional information so that they could proceed to the next stage even if they failed at the differential diagnosis step.<\/p>\n<p>The researchers found that the language models achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty.<\/p>\n<p>Study author Arya Rao noted that by evaluating LLMs in a stepwise fashion, research moves past treating them like test-takers and puts them in a doctor\u2019s position. <\/p>\n<p>\u201cThese models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn&#8217;t much information,\u201d she added. <\/p>\n<p>The researchers found that all of the models failed to produce an appropriate differential diagnosis more than 80% of the time. <\/p>\n<p>On final diagnosis, success rates ranged from around 60% to over 90% depending on the model.<\/p>\n<p>Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text.<\/p>\n<p>The results identified a top-performing cluster that included Grok 4, GPT-5, GPT-4.5, Claude 4.5 Opus, Gemini 3.0 Flash and Gemini 3.0 Pro. <\/p>\n<p>Medical professionals are still key<\/p>\n<p>However, the authors noted that despite version-based improvements and advantages in reasoning-optimised models, off-the-shelf LLMs have not yet achieved the level of intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning. <\/p>\n<p>\u201cOur results reinforce that large language models in healthcare continue to require a \u2018human in the loop\u2019 and very close oversight,\u201d Succi noted. <\/p>\n<p>Susana Manso Garc\u00eda, a member of the Artificial Intelligence and Digital Health working group of the Spanish Society of Family and Community Medicine, who was not involved in the study, said the findings carry a clear message for the public. <\/p>\n<p>\u201cThe study itself insists they [language models] should not be used to make clinical decisions without supervision. Therefore, whilst artificial intelligence represents a promising tool, human clinical judgement remains indispensable,\u201d she said. <\/p>\n<p>\u201cThe recommendation for the public is to use these technologies with caution and, when faced with any health concern, always consult a healthcare professional.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"Generative artificial intelligence (AI) still lacks the reasoning processes needed for safe clinical use, a new study has&hellip;\n","protected":false},"author":2,"featured_media":606939,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,19040,137,1679,105],"class_list":{"0":"post-606938","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-chatbot","14":"tag-health","15":"tag-medicine","16":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/606938","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=606938"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/606938\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/606939"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=606938"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=606938"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=606938"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}