exclusive Current-day LLMs are prediction engines and, as such, they can only find the most likely solution to problems, which is not necessarily the correct one. Though popular models have mostly become better at math, even top performer Gemini 3 Flash would receive a C if assessed with a letter grade.
Researchers affiliated with Omni Calculator, a maker of online calculators for specific applications, have subjected a new set of AI models to the company’s ORCA Benchmark, which consists of 500 practical math questions.
In their initial evaluation last November, OpenAI’s ChatGPT-5, Google’s Gemini 2.5 Flash, Anthropic’s Claude Sonnet 4.5, xAI’s Grok 4, and DeepSeek’s DeepSeek V3.2 (alpha) all did poorly, scoring 63 percent or less on math problems.
The latest set of contestants consists of ChatGPT-5.2, Gemini 3 Flash, Grok 4.1, and DeepSeek V3.2 (stable release). Sonnet 4.5 didn’t get re-evaluated as it hadn’t changed and its successor had not been released during the testing period.
For this second round of testing – provided to The Register prior to publication – all the models showed improvement except for Grok-4.1, which regressed.
Gemini 3.1 Flash saw its accuracy hit 72.8 percent, a gain of 9.8 percentage points from its predecessor. DeepSeek V3.2 reached 55.2 percent, a gain of 3.2 percentage points from its alpha version. ChatGPT 5.2 achieved 54.0 percent accuracy, up 4.6 percentage points. And Grok 4.1 slipped to 60.2 percent, a loss of 2.6 percentage points.
Image of chart showing ORCA test resuts for AI models – Click to enlarge
“A calculator is predictable,” said Dawid Siuda, researcher at ORCA, in a statement. “Ask it the same question today or next year, and the answer stays the same. AI doesn’t work that way. These systems are predicting the next likely word based on patterns. Mathematically, it’s possible for a model to get a question right today and wrong tomorrow.”
The researchers attempted to assess the variability of model responses with a metric dubbed “instability” – a measure of how often models changed their answers when asked the same question twice.
Gemini 3 Flash proved the most consistent, shifting only 46.1 percent for incorrect responses. ChatGPT, the researchers report, changed its answer 65.2 percent of the time. And DeepSeek V3.2 changed its answer for 68.8 percent of errors.
The ORCA researchers note that model performance improvements over time differ across domains. DeepSeek, they say, saw its performance on Biology & Chemistry questions go from 10.5 percent accuracy to 43.9 percent. And Gemini 3 Flash reached Math & Conversions accuracy of 93.2 percent, up from 83 percent. Grok 4.1 meanwhile lost 9 percentage points for its accuracy answering Health & Sports problems and lost 5.3 percentage points for Biology & Chemistry.
The researchers speculate that recent updates to Grok may have prioritized other capabilities than quantitative reasoning.
Noting that calculation errors now account for 39.8 percent of all mistakes, up from 33.4 percent, and that rounding errors slipped to 25.8 percent, down from 34.7 percent, the ORCA group conclude that AI models are getting better at making the math look right through formatting, while still struggling with arithmetic.
“AI models are essentially prediction engines rather than logic engines,” Siuda told The Register in an email. “Because they work on probability, they are basically guessing the next most likely number or word based on patterns they have seen before. It is like a student who memorizes every answer in a math book but never actually learns how to add.”
Siuda said we knew that about models previously and that hasn’t changed.
“They might get the right answer most of the time, but the second you give them a unique or tricky problem, or multi-step task, they stumble because they are not truly calculating anything,” he said. “It’s probably impossible to close this gap completely with the current technology, but if we merge LLMs with function calling well enough, it may be possible to solve.”
Function calling – farming out arithmetic to a deterministic source – is one way around the poor math handling of models.
“Major AI companies like Google and OpenAI are already doing this by having the AI call a function to do the actual calculation,” explained Siuda. “The real headache happens with long, messy problems. The AI has to keep track of every little result at each stage, and it usually gets overwhelmed or confused.”
Another possible avenue for improvement might be teaching models to verify responses through formal proofs. As noted in Nature last November, Google’s DeepMind has developed an approach that scored a silver medal result on the International Mathematical Olympiad through reinforcement learning based on proofs developed with the Lean programming language and proof assistant.
But for the time being, trust no AI. ®
