Olympiad-Level AI Performance Is Here—What Now?

Paul TschisgaleDepartment of Physics Education, Leibniz Institute for Science and Mathematics Education, Kiel, Germany

August 13, 2025&bullet; Physics 18, 147

As large language models improve, the real challenge is not how to shield education from AI, but how to embrace AI as a cornerstone of future physics learning and teaching.

P. Tschisgale/IPN; Figure created using the GPT-4o image generator (OpenAI)

If AI openly competed in a Physics Olympiad, it would likely win medals and disappoint its human adversaries.

P. Tschisgale/IPN; Figure created using the GPT-4o image generator (OpenAI)

If AI openly competed in a Physics Olympiad, it would likely win medals and disappoint its human adversaries.×

Since large language models (LLMs)—a prominent type of AI—became widely available to the public, their burgeoning capabilities have sparked both fascination and concern across many fields. As their capabilities in physics became more apparent, I began to wonder what this development might mean for settings where individual expertise is supposed to shine. In 2024, I completed my PhD on how students engage in high-level problem solving, particularly in the context of the German Physics Olympiad, a multiround competition where highly motivated students work on challenging physics problems beyond the standard curriculum. From this perspective, a worry emerged: The Physics Olympiad represents a setting where LLMs could be used quietly but effectively, raising difficult questions about whether the competition is fair and its integrity can still be upheld.

If AI models could solve Olympiad-level physics problems as well—or even better—than the Olympiad participants themselves, the Olympiad would no longer reward deep understanding or genuine effort. Instead, it would risk rewarding those who relied on LLMs, regardless of their own level of expertise. To better understand the scope of the potential problem, my colleagues and I set out to test just how well contemporary LLMs perform on Olympiad-level physics problems. In our study, we evaluated, using actual problems from the German Physics Olympiad, two advanced LLMs: GPT-4o, the previous default model behind ChatGPT, and o1-preview, a newer model optimized for reasoning [1].

Before conducting the study, I already expected the LLMs to do reasonably well. Previous studies had already shown that LLMs could answer standard physics questions and solve problems at the high school or early university level. But I was taken aback by how well they performed on Olympiad-level problems—problems designed to challenge some of the best students in the country. GPT-4o outperformed the average human participant, and the newer o1-preview model did even better.

If LLMs can produce high-quality solutions on a par with or better than those of top students, then any observed performance in unsupervised settings—be it homework rounds of a competition, homework assignments, or online exams—may be suspect. This new reality challenges the validity of many current assessment formats and forces us to reconsider not only how we measure physics expertise but also what kinds of knowledge and abilities we want students to develop in the first place. How should physics education respond to this?

One possible response might be to ban the use of AI in educational settings and enforce this using detection tools. But this is unlikely to succeed, as this would set up an ongoing arms race between increasingly sophisticated LLMs and the tools designed to detect their output. Detection methods will almost always be one step behind, making it difficult to reliably distinguish between human- and AI-generated work. Another conceivable approach might be to rely more on physics problems in assessment situations that exploit current weaknesses of LLMs—for example, problems that require interpreting diagrams. Yet this is a short-term fix at best, as such weaknesses may soon disappear. To ensure that what we evaluate still reflects students’ own thinking, we may need to rely more heavily on supervised formats, such as oral exams or in-person written assessments. These formats, however, would demand significantly more resources.

But instead of focusing solely on mitigating the risks of AI, shouldn’t we be asking another question? Why not let students use AI and focus on teaching them to do so thoughtfully and responsibly? AI is here to stay, and it will play a major role in many students’ academic and professional futures. We should equip students to work with AI tools such as LLMs because the ability to use such powerful tools may soon be as important as mastering a subject matter itself.

As AI continues to improve, it may seem that we’re entering an era where it might seem that students no longer need to memorize formulas or solve complex equations by hand—because AI can do so faster, and often better. Yet this view is ultimately too simplistic. AI models still make mistakes, just like humans do. However, these mistakes are often hard to spot because the models present their responses in the polished language of experts. That’s why students still need a solid foundation in physics—to tell sound reasoning from superficial gloss.

What’s needed is a shift in educational priorities. This means not only teaching physics content but also helping students develop the ability to critically evaluate solutions—especially those generated by AI. In many ways, this mirrors how we already approach solving problems collaboratively. In such settings, students don’t always complete every step alone; they question, reflect, and build on shared inputs. Interacting with an LLM should be no different. The LLM may offer suggestions, but it’s the student’s responsibility to judge, refine, and, if needed, challenge those suggestions.

That kind of human–AI collaboration is something education should be working toward. In this vision, physics education remains grounded in teaching students conceptual knowledge and basic problem-solving strategies. But it places greater emphasis on critical thinking, reflective judgment, and the ability to engage productively with AI. Students still need a strong foundation in physics—but the way they apply their knowledge is evolving. Rather than competing with AI, they’ll collaborate with it, drawing on its strengths, while compensating for its limitations. That’s the future we should be teaching toward.

ReferencesP. Tschisgale et al., “Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment,” Phys. Rev. Phys. Educ. Res. 21, 020115 (2025).About the Author Image of Paul Tschisgale

Paul Tschisgale is a postdoctoral researcher at the Leibniz Institute for Science and Mathematics Education in Kiel, Germany. He earned his PhD in physics education at Kiel University, Germany, in 2024. His research focuses on nurturing high-ability students and on using AI to improve physics learning, with an emphasis on the assessment and development of physics problem-solving abilities.

Olympiad-Level AI Performance Is Here—What Now?

Tags: