In December 2025,a group of researchers from around the world, including UC Berkeley math professor Nikhil Srivastava, gathered inside the Simons Institute for the Theory of Computing at UC Berkeley.
Srivastava, an associate professor of mathematics at UC Berkeley and senior scientist at the Simons Institute for the Theory of Computing,met with researchers on a mission to create a new way of assessing the mathematical capabilities of AI.
Led by Stanford University math professor Mohammed Abouzaid and University of Texas at Austin math professor Rachel Ward, the team of 11 embarked on a project called First Proof. A preprint published by the team describes First Proof as a way to assess AI models’ abilities to solve research-level math questions.
“It was a way for … mathematicians (to get) the narrative back on AI solving math problems or proving mathematical theorems,” said Fields medalist Martin Hairer, a math professor at Imperial College London and EPFL, the Swiss Federal Institute of Technology in Lausanne. The Fields Medal is often considered one of the “Nobels of math.”“It was about trying to test how these things perform on things that mathematicians actually care about.”
Unlike benchmarks that evaluate AI models based on their success at solving Olympiad math problems, the authors presented 10 questions that “arose naturally in the research process of the authors.”
“AIs have basically solved these olympiad problems,” said UCLA math professor and Fields medalist Terence Tao, who is not affiliated with First Proof. “They don’t have 100% success right now, but they have gold medal-level performance. But it is not really an apples-to-apples comparison because the memory that an AI model can store is millions of times larger than a high school student. … So the next level after that is research-level problems.”
The problems, each written and proven by the researchers, are not substantial enough to constitute a paper on their own. However, the solutions are part of larger pieces of unpublished research, according to preprint co-author and recipient of the MacArthur Foundation’s Genius Grant Daniel Spielman, who is a sterling professor of computer science and a professor of statistics and data Science and of mathematics at Yale University.
The solutions to these research questions are encrypted until 11:59 p.m. Pacific Time on Friday, according to First Proof’s website. Because the solutions to these problems are not available to the public, it eliminates the large language models’ ability to search the web for answers.
Before the solutions are posted, the team encourages the broader math community to get involved by trying to solve the problems with AI themselves, inviting participants to post their findings on social media using the hashtag #1stProof.
“I’ve looked around on social media and looked at what people are doing, and it’s amazing,” Srivastava said. “It’s just really nice to see the interest from people who are not experts in the area — they’ve come up with very interesting ways of combining different AI systems to work on the problems. … And the best part of it is that a lot of people say they’re learning math in this process.”
According to Hairer, the team even received solutions generated by internal models at OpenAI, “clearly of better quality” than ChatGPT is currently capable of producing.
The preprint states that the AI models used, GPT-5.2 Pro and Gemini 3 Deep Think, “struggle to answer” many of the questions posed in a single attempt.The researchers did not interact with the AI systems at all during the process, but speculated that doing so would “coax the systems to produce better answers.”
Spielman and Hairer both said they were not surprised that the models were unsuccessful. However, many of the researchers said they have tried using AI in their research already and expressed optimism about the future of AI.
“I’m really looking forward to having some AI assistants help me in my research,” Spielman said. “I have this dream of telling it, ‘check this idea, check that idea, see if this part works.’”
According to First Proof’s website, the team will not be assessing the correctness of any solutions posted by the community, since the researchers do not consider the current problem set to be a solid benchmark.
However, according to Genius Grant recipient and preprint co-author Lauren Williams, Dwight Parker Robinson professor of mathematics at Harvard University,the team will create a grading scheme to determine the accuracy of solutions for future problem sets.
The team plans to create a second set of questions in the coming months, according to the preprint, and is open to forming agreements with companies that would like to test experimental AI models on the forthcoming problems.
The authors published the preprint Feb. 6, a day before Euler Day on Feb. 7. The date is a nod to e, or Euler’s number, in mathematics, which is approximately 2.7.
First Proof’s website explains the project’s name as a reference to a “first proof” in baking, a bulk fermentation process “in which one lets the entire batch of dough ferment as one mass, before dividing and shaping it into loaves,” alluding to the team’s intention to allow ideas to ferment in the community in order to produce a more structured AI benchmark.
Keeping with the baking theme, Williams said the team’s tentative plan is to release details about the next batch of problems on Pi Day, March 14.
“This is, I think, a cultural shift,” Tao said. “ I think we will start seeing … people start assembling lists of problems and just seeing the solutions trickle in from AI. It’s a new way of doing mathematics.”