{"id":56156,"date":"2025-08-09T20:41:16","date_gmt":"2025-08-09T20:41:16","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/56156\/"},"modified":"2025-08-09T20:41:16","modified_gmt":"2025-08-09T20:41:16","slug":"mathematicians-question-ai-performance-at-international-math-olympiad","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/56156\/","title":{"rendered":"Mathematicians Question AI Performance at International Math Olympiad"},"content":{"rendered":"<p class=\"\" data-block=\"sciam\/paragraph\">A defining memory from my senior year of high school was a nine-hour math exam with just six questions. Six of the top scorers won slots on the U.S. team for the International Math Olympiad (IMO), the world\u2019s longest running math competition for high school students. I didn\u2019t make the cut, but became a tenured mathematics professor anyway.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">This year\u2019s olympiad, held last month on Australia\u2019s Sunshine Coast, had an unusual sideshow. While 110 students from around the world went to work on complex math problems using pen and paper, several AI companies quietly tested new models in development on a computerized approximation of the exam. Right after the closing ceremonies, <a href=\"https:\/\/x.com\/OpenAI\/status\/1946594928945148246\" rel=\"nofollow\">OpenAI<\/a> and later <a href=\"https:\/\/deepmind.google\/discover\/blog\/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad\/\" rel=\"nofollow noopener\" target=\"_blank\">Google DeepMind<\/a> announced that their models earned (unofficial) gold medals for solving five of the six problems. Researchers like S\u00e9bastien Bubeck of OpenAI celebrated these models\u2019 successes as a \u201c<a href=\"https:\/\/www.uniladtech.com\/news\/ai\/ai-researcher-moon-landing-moment-artificial-intelligence-479691-20250722\" rel=\"nofollow noopener\" target=\"_blank\">moon landing moment<\/a>\u201d by industry.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">But are they? Is AI going to replace <a href=\"https:\/\/www.scientificamerican.com\/article\/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai\/\" rel=\"nofollow noopener\" target=\"_blank\">professional mathematicians<\/a>? I\u2019m still waiting for the proof.<\/p>\n<p>On supporting science journalism<\/p>\n<p>If you&#8217;re enjoying this article, consider supporting our award-winning journalism by <a href=\"https:\/\/www.scientificamerican.com\/getsciam\/\" rel=\"nofollow noopener\" target=\"_blank\">subscribing<\/a>. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">The hype around this year\u2019s AI results is easy to understand because the olympiad is hard. To wit, in my senior year of high school, I set aside calculus and linear algebra to focus on olympiad-style problems, which were more of a challenge. Plus the cutting-edge models still in development did so much better at the exam than the commercial models already out there. In a parallel contest administered by <a href=\"http:\/\/matharena.ai\" rel=\"nofollow noopener\" target=\"_blank\">MathArena.ai<\/a>, Gemini 2.5 pro, Grok 4, o3 high, o4-mini high and DeepSeek R1 <a href=\"https:\/\/matharena.ai\/imo\/\" rel=\"nofollow noopener\" target=\"_blank\">all failed to produce a single completely correct solution<\/a>. It shows that AI models are getting smarter, their reasoning capabilities improving rather dramatically.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">Yet I\u2019m still not worried.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">The latest models just got a good grade on a single test\u2014as did many of the students\u2014and a head-to-head comparison isn\u2019t entirely fair. The models often employ a \u201cbest-of-n\u201d strategy, generating multiple solutions and then grading themselves to select the strongest. This is akin to having several students work independently, then get together to pick the best solution and submit only that one. If the human contestants were allowed this option, their scores would likely improve too.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">Other mathematicians are similarly cautioning against the hype. IMO gold medalist <a href=\"https:\/\/www.scientificamerican.com\/article\/ai-will-become-mathematicians-co-pilot\/\" rel=\"nofollow noopener\" target=\"_blank\">Terence Tao<\/a> (currently a mathematician at the University of California, Los Angeles) noted on <a href=\"https:\/\/mathstodon.xyz\/@tao\/114881418225852441\" rel=\"nofollow noopener\" target=\"_blank\">Mastodon<\/a> that what AI can do depends on what the testing methodology is. IMO president Gregor Dolinar said that the organization \u201c<a href=\"https:\/\/imo2025.au\/wp-content\/uploads\/2025\/07\/IMO-2025_ClosingDayStatement-19072025.pdf\" rel=\"nofollow noopener\" target=\"_blank\">cannot validate the methods [used by the AI models], including the amount of compute used or whether there was any human involvement, or whether the results can be reproduced<\/a>.\u201d<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">Besides, IMO exam questions don\u2019t compare to the kinds of questions professional mathematicians try to answer, where it can take nine years, rather than nine hours, to solve a problem at the frontier of mathematical research. As Kevin Buzzard, a mathematics professor at Imperial College London, said in an online forum, \u201cWhen I arrived in Cambridge UK as an undergraduate clutching my IMO gold medal I was in no position to help any of the research mathematicians there.\u201d<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">These days, mathematical research can take more than one lifespan to acquire the right expertise. Like many of my colleagues, I\u2019ve been tempted to try \u201cvibe proving\u201d\u2014having a math chat with an LLM as one would with a colleague, asking \u201cIs it true that&#8230;\u201d followed by a technical mathematical conjecture. The chatbot often then supplies a clearly articulated argument that, in my experience, tends to be correct when it comes to standard topics but subtly wrong at the cutting edge. For example, every model I\u2019ve asked has made the same subtle mistake in assuming that the theory of idempotents behaves the same for weak infinite-dimensional categories as it does for ordinary ones, something that human experts (trust me on this) in my field know to be false.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">I\u2019ll never trust an LLM\u2014which at its core is just predicting what text will come next in a string of words, based on what\u2019s in its dataset\u2014to provide a mathematical proof that I can\u2019t verify myself.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">The good news is, we do have an automated mechanism for determining whether proofs can be trusted. Relatively recent tools called \u201cproof assistants\u201d are software programs (they don\u2019t use AI) designed to check whether a logical argument proves the stated claim. They are increasingly attracting attention from mathematicians like Tao, Buzzard and myself who want more assurance that our own proofs are correct. And they offer the potential to help democratize mathematics and even improve AI safety.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">Suppose I received a letter, in unfamiliar handwriting, from Erode, a city in Tamil Nadu, India, purporting to contain a mathematical proof. Maybe its ideas are brilliant, or maybe they\u2019re nonsensical. I\u2019d have to spend hours carefully studying every line, making sure the argument flowed step-by-step, before I\u2019d be able to determine whether the conclusions are true or false.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">But if the mathematical text were written in an appropriate computer syntax instead of natural language, a proof assistant could check the logic for me. A human mathematician, such as I, would then only need to understand the meaning of the technical terms in the theorem statement. In the case of Srinivasa Ramanujan, a generational mathematical genius who did hail from Erode, an expert did take the time to carefully decipher his letter. In 1913 Ramanujan wrote to the British mathematician G. H. Hardy with his ideas. Luckily, Hardy recognized Ramanujan\u2019s brilliance and invited him to Cambridge to collaborate, launching the career of one of the all-time mathematical \u201cgreats.\u201d<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">What\u2019s interesting is that some of the AI IMO contestants submitted their answers in the language of the Lean computer proof assistant so that the computer program could automatically check for errors in their reasoning. A start-up called Harmonic posted formal proofs generated by their model for five of the six problems, and ByteDance achieved a silver-medal level performance by solving four of the six problems. But the questions had to be written to accommodate the models\u2019 language limitations, and they still needed days to figure it out.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">Still, formal proofs are uniquely trustworthy. While so-called \u201creasoning\u201d models are prompted to break problems down into pieces and explain their \u201cthinking\u201d step by step, the output is as likely to produce an argument that sounds logical but isn\u2019t, as to constitute a genuine proof. By contrast, a proof assistant will not accept a proof unless it is fully precise and fully rigorous, justifying every step in its chain-of-thought. In some circumstances, a hand-waving or approximate solution is good enough, but when mathematical accuracy matters, we should demand that AI-generated proofs are formally verifiable.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">Not every application of generative AI is so black and white, where humans with the right expertise can determine whether the results are correct or incorrect. In life, there is a lot of uncertainty and it\u2019s easy to make mistakes. As I learned in high school, one of the best things about math is the fact that you can prove definitively that some ideas are wrong. So I\u2019m happy to have an AI try to solve my personal math problems, but only if the results are formally verifiable. And we aren\u2019t quite there, yet.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.<\/p>\n<p>It\u2019s Time to Stand Up for Science<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">Before you close the page, we need to ask for your support. Scientific American has served as an advocate for science and industry for 180 years, and we think right now is the most critical moment in that two-century history.<\/p>\n<p class=\"\" data-block=\"sciam\/paragraph\">We\u2019re not asking for charity. If you <a href=\"https:\/\/www.scientificamerican.com\/getsciam\/?utm_source=site&amp;utm_medium=display&amp;utm_term=editors_footer_note_2025\" rel=\"nofollow noopener\" target=\"_blank\">become a Digital, Print or Unlimited subscriber<\/a> to Scientific American, you can help ensure that our coverage is centered on meaningful research and discovery; that we have the resources to report on the decisions that threaten labs across the U.S.; and that we support both future and working scientists at a time when the value of science itself often goes unrecognized. <a href=\"https:\/\/www.scientificamerican.com\/getsciam\/?utm_source=site&amp;utm_medium=display&amp;utm_term=editors_footer_note_2025\" rel=\"nofollow noopener\" target=\"_blank\">Click here to subscribe.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"A defining memory from my senior year of high school was a nine-hour math exam with just six&hellip;\n","protected":false},"author":2,"featured_media":56157,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-56156","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/56156","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=56156"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/56156\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/56157"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=56156"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=56156"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=56156"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}