{"id":139113,"date":"2025-11-18T03:23:11","date_gmt":"2025-11-18T03:23:11","guid":{"rendered":"https:\/\/www.newsbeep.com\/il\/139113\/"},"modified":"2025-11-18T03:23:11","modified_gmt":"2025-11-18T03:23:11","slug":"ai-is-actually-bad-at-math-orca-shows-the-register","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/il\/139113\/","title":{"rendered":"AI is actually bad at math, ORCA shows \u2022 The Register"},"content":{"rendered":"<p>In the world of George Orwell&#8217;s 1984, two and two make five. And large language models are not much better at math.<\/p>\n<p>Though AI models have been trained to emit the correct answer and to recognize that &#8220;2 + 2 = 5&#8221; might be a reference to the errant equation&#8217;s use as a Party loyalty test in Orwell&#8217;s dystopian novel, they still can&#8217;t calculate reliably.<\/p>\n<p>Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany, and Poland, devised a math benchmark called ORCA (Omni Research on Calculation in AI), which poses a series of math-oriented natural language questions in a wide variety of technical and scientific fields. Then they put five leading LLMs to the test.<\/p>\n<p>ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 all scored a failing grade of 63 percent or less.<\/p>\n<p>There are various other benchmarks used to assess the math capabilities of AI models, such as <a href=\"https:\/\/llm-stats.com\/benchmarks\/gsm8k\" rel=\"nofollow noopener\" target=\"_blank\">GSM8K<\/a> and <a href=\"https:\/\/artificialanalysis.ai\/evaluations\/math-500\" rel=\"nofollow noopener\" target=\"_blank\">MATH-500<\/a>. If you were to judge by AI models&#8217; scores on many of these tests, you might assume machine learning has learned nearly everything, with some models scoring 0.95 or above.\u00a0<\/p>\n<p>But benchmarks, <a href=\"https:\/\/www.theregister.com\/2025\/11\/07\/measuring_ai_models_hampered_by\/\" rel=\"nofollow noopener\" target=\"_blank\">as we&#8217;ve noted<\/a>, are often designed without much scientific rigor.\u00a0<\/p>\n<p>The researchers behind the ORCA (Omni Research on Calculation in AI) Benchmark \u2013 Claudia Herambourg, Dawid Siuda, Julia Kopczy\u0144ska, Joao R. L. Santos, Wojciech Sas, and Joanna \u015amieta\u0144ska-Nowak \u2013 argue that while models like OpenAI&#8217;s GPT-4 have scored well on tests like GSM8K and MATH, prior research shows LLMs still make errors of logic and arithmetic. According to Oxford University&#8217;s <a href=\"https:\/\/ourworldindata.org\/grapher\/test-scores-ai-capabilities-relative-human-performance\" rel=\"nofollow noopener\" target=\"_blank\">Our World in Data<\/a> site, which measures AI models&#8217; performance relative to a human baseline score of 0, math reasoning for AI models scores -7.44 (based on April 2024 data).<\/p>\n<p>What&#8217;s more, the authors say, many of the existing benchmark data sets have been incorporated into model training data, a situation similar to students being given the answers prior to an exam. Thus, they contend, ORCA is needed to evaluate actual computational reasoning as opposed to pattern memorization.<\/p>\n<p>According to their study, distributed via preprint service <a href=\"https:\/\/arxiv.org\/abs\/2511.02589\" rel=\"nofollow noopener\" target=\"_blank\">arXiv<\/a> and on Omni Calculator&#8217;s <a href=\"https:\/\/www.omnicalculator.com\/reports\/omni-research-on-calculation-in-ai-benchmark\" rel=\"nofollow noopener\" target=\"_blank\">website<\/a>, ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, DeepSeek V3.2 &#8220;achieved only 45\u201363 percent accuracy, with errors mainly related to rounding (35 percent) and calculation mistakes (33 percent).&#8221;<\/p>\n<p>The evaluation was conducted in October 2025, using 500 math-oriented prompts in various categories: Biology &amp; Chemistry, Engineering &amp; Construction, Finance &amp; Economics, Health &amp; Sports, Math &amp; Conversions, Physics, and Statistics &amp; Probability.<\/p>\n<p>&#8220;Gemini 2.5 Flash achieved the highest overall accuracy (63 percent), followed closely by Grok 4 (62.8 percent), with DeepSeek V3.2 ranking third at 52.0 percent,&#8221; the paper says.\u00a0<\/p>\n<p>&#8220;ChatGPT-5 and Claude Sonnet 4.5 performed comparably but at lower levels (49.4 percent and 45.2 percent, respectively), indicating that even the most advanced proprietary models still fail on roughly half of all deterministic reasoning tasks. These results confirm that progress in natural-language reasoning does not directly translate into consistent computational reliability.&#8221;<\/p>\n<p>Claude Sonnet 4.5 had the lowest scores overall \u2013 it failed to score better than 65 percent on any of the question categories. And DeepSeek V3.2 was the most uneven, with strong Math &amp; Conversions performance (74.1 percent) but dismal Biology &amp; Chemistry (10.5 percent) and Physics (31.3 percent) scores.<\/p>\n<p>And yet, these scores may represent nothing more than a snapshot in time, as these models often get adjusted or revised. Consider this question from the Engineering &amp; Construction category, as cited in the paper:<\/p>\n<p>Prompt: Consider that you have 7 blue LEDs (3.6V) connected in parallel, together with a resistor, subject to a voltage of 12 V and a current of 5 mA. What is the value of the power dissipation in the resistor (in mW)?<\/p>\n<p>Expected result: 42 mW<\/p>\n<p>Claude Sonnet 4.5: 294 mW<\/p>\n<p>When El Reg put this prompt to Claude Sonnet 4.5, the model said it was uncertain whether the 5 mA figure referred to current per LED (incorrect) or the total current (correct). It offered both the incorrect 294 mW answer and, as an alternative, the correct 42 mW answer.<\/p>\n<p>In short, AI benchmarks don&#8217;t necessarily add up. But if you want them to, you may find the result is five. \u00ae<\/p>\n","protected":false},"excerpt":{"rendered":"In the world of George Orwell&#8217;s 1984, two and two make five. And large language models are not&hellip;\n","protected":false},"author":2,"featured_media":139114,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[345,343,344,85,46,125],"class_list":{"0":"post-139113","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-il","12":"tag-israel","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/139113","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/comments?post=139113"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/posts\/139113\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media\/139114"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/media?parent=139113"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/categories?post=139113"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/il\/wp-json\/wp\/v2\/tags?post=139113"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}