Stay informed on the latest psychology and neuroscience research—follow PsyPost on LinkedIn for daily updates and insights.

Artificial intelligence systems often perform impressively on standardized medical exams—but new research suggests these test scores may be misleading. A study published in JAMA Network Open indicates that large language models, or LLMs, might not actually “reason” through clinical questions. Instead, they seem to rely heavily on recognizing familiar answer patterns. When those patterns were slightly altered, the models’ performance dropped significantly—sometimes by more than half.

Large language models are a type of artificial intelligence system trained to process and generate human-like language. They are built using vast datasets that include books, scientific papers, web pages, and other text sources. By analyzing patterns in this data, these models learn how to respond to questions, summarize information, and even simulate reasoning. In recent years, several models have achieved high scores on medical exams, sparking interest in using them to support clinical decision-making.

But high test scores do not necessarily indicate an understanding of the underlying content. Instead, many of these models may simply be predicting the most likely answer based on statistical patterns. This raises the question: are they truly reasoning about medical scenarios, or just mimicking answers they’ve seen before?That’s what the researchers behind the new study set out to examine.

“I am particularly excited about bridging the gap between model building and model deployment and the right evaluation is key to that,” explained study author Suhana Bedi, a PhD student at Stanford University.

“We have AI models achieving near perfect accuracy on benchmarks like multiple choice based medical licensing exam questions. But this doesn’t reflect the reality of clinical practice. We found that less than 5% of papers evaluate LLMs on real patient data which can be messy and fragmented.”

“So, we released a benchmark suite of 35 benchmarks mapped to a taxonomy of real medical and healthcare tasks that were verified by 30 clinicians. We found that most models (including reasoning models) struggled on Administrative and Clinical Decision Support tasks.”

“We hypothesized that this was because these tasks involved complex reasoning scenarios that couldn’t be solved through pattern matching alone, exactly the kind of clinical thinking that matters in real practice,” Bedi explained. “With everyone talking about deploying AI in hospitals, we thought this was a very important question to answer.”

To investigate this, the research team created a modified version of the MedQA benchmark. They selected 100 multiple-choice questions from the original test and rewrote a subset of them to replace the correct answer with “None of the other answers,” or NOTA. This subtle shift forced the models to rely on actual medical reasoning rather than simply recognizing previously seen answer formats. A practicing clinician reviewed all changes to ensure the new “None of the other answers” response was medically appropriate.

Stay informed with the latest psychology and neuroscience research—sign up for PsyPost’s newsletter and get new discoveries delivered straight to your inbox.

Sixty-eight of the questions met the criteria for this test set. Each question presented a clinical scenario and asked for the most appropriate next step in treatment or diagnosis. One example involved a newborn with an inward-turning foot—a typical case of metatarsus adductus, which usually resolves on its own. In the original version, “Reassurance” was the correct answer. In the modified version, “Reassurance” was removed and replaced with “None of the other answers,” making the task more challenging.

Bedi and her colleagues then evaluated six widely used artificial intelligence models, including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, and others. All models were prompted to reason through each question using a method called chain-of-thought, which encourages step-by-step explanations of their answers. This approach is intended to support more deliberate reasoning rather than simple guesswork.

The models were tested on both the original and modified questions, and the researchers compared their performance across these two conditions. They used statistical methods to measure the significance of any accuracy drops, with a focus on whether each model could maintain performance when familiar patterns were removed.

The results suggest that none of the models passed this test unscathed. All six experienced a noticeable decline in accuracy when presented with the NOTA-modified questions. Some models, like DeepSeek-R1 and o3-mini, were more resilient than others, showing drops of around 9 to 16 percent.

But the more dramatic declines were seen in widely used models such as GPT-4o and Claude 3.5 Sonnet, which showed reductions of over 25 percent and 33 percent, respectively. Llama 3.3-70B had the largest drop in performance, answering nearly 40 percent more questions incorrectly when the correct answer was replaced with “None of the other answers.”

“What surprised us most was the consistency of the performance decline across all models, including the most advanced reasoning models like DeepSeek-R1 and o3-mini,” Bedi told PsyPost.

These findings suggest that current AI models tend to rely on recognizing common patterns in test formats, rather than reasoning through complex medical decisions. When familiar options are removed or altered, performance deteriorates, sometimes dramatically.

The researchers interpret this pattern as evidence that many AI systems may not be equipped to handle novel clinical situations—at least not yet. In real-world medicine, patients often present with overlapping symptoms, incomplete histories, or unexpected complications. If an AI system cannot handle minor shifts in question formatting, it may also struggle with these kinds of real-life variability.

“These AI models aren’t as reliable as their test scores suggest,” Bedi said. “When we changed the answer choices slightly, performance dropped dramatically, with some models going from 80% accuracy down to 42%. It’s like having a student who aces practice tests but fails when the questions are worded differently. For now, AI should help doctors, not replace them.”

While the study was relatively small, limited to 68 test questions, the consistency of the performance decline across all six models raised concern. The authors acknowledge that more research is needed, including testing larger and more diverse datasets and evaluating models using different methods, such as retrieval-augmented generation or fine-tuning on clinical data.

“We only tested 68 questions from one medical exam, so this isn’t the full picture of AI capabilities,” Bedi noted. “Also, we used a specific way to test reasoning, there might be other approaches that reveal different strengths or weaknesses. Real clinical deployment would likely involve more sophisticated setups than what we tested.”

Still, the authors suggest their results point to three major priorities moving forward: building evaluation tools that separate true reasoning from pattern recognition, improving transparency around how current systems handle novel medical problems, and developing new models that prioritize reasoning abilities.

“We want to build better tests that can tell the difference between AI systems that reason versus those that just memorize patterns,” Bedi said. “We’re also hoping this work pushes the field toward developing AI that’s more genuinely reliable for medical use, not just good at taking tests.”

“The main thing is that impressive test scores don’t automatically mean an AI system is ready for the real world. Medicine is complicated and unpredictable, and we need AI systems that can handle that complexity safely. This research is about making sure we get there responsibly.”

The study, “Fidelity of Medical Reasoning in Large Language Models,” was authored by Suhana Bedi, Yixing Jiang, Philip Chung, Sanmi Koyejo, and Nigam Shah.