At a secret meeting in 2025, some of the world’s leading mathematicians gathered to test OpenAI’s newest large language model, o4-mini.

Experts at the meeting were amazed by how much the model’s responses sounded like a real mathematician when delivering a complex proof.

You may like

Ono acknowledged that the model might be giving convincing — but potentially incorrect — answers.

“Unfortunately, the AI is much better at sounding like they have the right answer than actually getting it … right or wrong; they will always look convincing,”

Terry Tao, UCLA mathematician

“If you were a terrible mathematician, you would also be a terrible mathematical writer, and you would emphasize the wrong things,” Terry Tao, a mathematician at UCLA and the 2006 winner of the prestigious Fields Medal, told Live Science. “But AI has broken that signal.”

Naturally, mathematicians are beginning to worry that AI will spam them with convincing-looking proofs that actually contain flaws that are difficult for humans to detect.

Tao warned that AI-generated arguments might be incorrectly accepted because they look rigorous.

You may like

“Unfortunately, the AI is much better at sounding like they have the right answer than actually getting it … right or wrong; they will always look convincing,” Tao said.

He urged caution on the acceptance of AI ‘”proofs.” “One thing we’ve learned from using AIs is that if you give them a goal, they will cheat like crazy to achieve the goal,” Tao said.

While it may seem largely abstract to ask whether we can truly “prove” highly technical mathematical conjectures if we can’t understand the proofs, the answers can have significant implications. After all, if we can’t trust a proof, we can’t develop further mathematical tools or techniques from that foundation.

For instance, one of the major outstanding problems in computational math, dubbed P vs. NP, asks, in essence, whether problems whose solutions are easy to check are also easy to find in the first place. If we can prove that, we could transform scheduling and routing, streamline supply chains, accelerate chip design, and even speed up drug discovery. The flip side is that a verifiable proof might also compromise the security of most current cryptographic systems. Far from being arcane, there is real jeopardy in the answers to these questions.

Andrew Granville, a mathematician at the University of Montreal, suspects there are issues even with some of the better-known and more scrutinized human-made mathematical proofs.

There’s some evidence for that claim. “There have been some famous papers that are wrong because of little linguistic issues,” Granville told Live Science.

Perhaps the best-known example is Andrew Wiles‘ proof of Fermat’s last theorem. The theorem states that although there are whole numbers where one square plus another square equals a third square (like 32+42=52), there are no whole numbers that make the same true for cubes, fourth powers, or any other higher powers.


Fermat proposed what’s now known as his “last” theorem in 1637. The 1670 book “Arithmetica” includes Fermat’s commentary, which was published after his death. (Image credit: Wikimedia Commons)

Wiles famously spent seven years working in almost complete isolation and, in 1993, presented his proof as a lecture series in Cambridge, to great fanfare. When Wiles finished his last lecture with the immortal line “I think I’ll stop there,” the audience broke into thunderous applause and Champagne was uncorked to celebrate the achievement. Newspapers around the world proclaimed the mathematician’s victory over the 350-year-old problem.

A man with curly brown hair and wireframe glasses wearing a black sweater stands in front of a green chalkboard with equations on it written in white scrawl with a seated crowd in front of him

Andrew Wiles describing his proof of the Taniyama-Shimura Conjecture in 1993. His initial proof contained an error, but he ultimately found a final solution which would lead to him proving Fermat’s last theorem. (Image credit: Science Photo Library)

During the peer-review process, however, a reviewer spotted a significant flaw in Wiles’ proof. He spent another year working on the problem and eventually fixed the issue.

But for a short time, the world believed the proof was solved, when, in fact, it hadn’t been.

Kevin Buzzard, a mathematician at Imperial College London, is one of the leading proponents of the formal verification. “I started in this business because I was worried that human proofs were incomplete and incorrect and that we humans were doing a poor job documenting our arguments,” Buzzard told Live Science.

In addition to verifying existing human proofs, AI, working in conjunction with programs like Lean, could be game-changing, mathematicians said.

“If we force AI output to produce things in a formally verified language, then this, in principle, solves most of the problem,” of AI coming up with convincing-looking, but ultimately incorrect proofs, Tao said.

“There are papers in mathematics where nobody understands the whole paper. You know, there’s a paper with 20 authors and each author understands their bit. Nobody understands the whole thing. And that’s fine. That’s just how it works.”

Kevin Buzzard, Imperial College London mathematician

Buzzard agreed. “You would like to think that maybe we can get the system to not just write the model output, but translate it into Lean, run it through Lean,” he said. He imagined a back-and-forth interaction between Lean and the AI in which Lean would point out errors and the AI would attempt to correct them.

If AI models can be made to work with formal verification languages, AI could then tackle some of the most difficult problems in mathematics by finding connections beyond the scope of human creativity, experts told Live Science.

“AI is very good at finding links between areas of mathematics that we wouldn’t necessarily think to connect,” Marc Lackenby, a mathematician at the University of Oxford, told Live Science.

The four color theorem states that any map can be colored in with just four colors, such that none of the same colors touch each other. It was formally proven, largely using a computer, by 2005. (Image credit: Science Photo Library)

Almost 50 years ago, in 1976, mathematicians broke the problem into thousands of small, checkable cases and wrote computer programs to verify each one. As long as the mathematicians were convinced there weren’t any problems with the code they’d written, they were reassured the proof was correct. The first computer-assisted proof of the four-color theorem was published in 1977. Confidence in the proof built gradually over the years and was reinforced to the point of almost universal acceptance when a simpler, but still compute-aided, proof was produced in 1997 and a formally verified machine-checked proof was published in 2005.

“The four-color theorem was proved with a computer,” Buzzard noted. “People were very upset about that. But now it’s just accepted. It’s in textbooks.”