As artificial intelligence has advanced over the years, the methods used to measure its capabilities have become outdated. Tests that once challenged advanced AI models are now being solved with ease, making it harder for researchers to pinpoint what current systems are actually capable of.
However, an international team of researchers has recently developed a new exam designed to test the limits of modern AI systems. Known as Humanity’s Last Exam (HLE), the assessment includes 2,500 expert-level questions spanning disciplines from mathematics and natural sciences to ancient languages and humanities. Details of the project and its results are outlined in a recent study published in Nature.
Initial results indicate that even the most advanced AI models struggled with this exam. GPT-4o scored 2.7%, Claude 3.5 Sonnet 4.1%, and OpenAI’s o1 model reached about 8% accuracy. More recent systems, such as Gemini 3.1 Pro and Claude Opus 4.6, improved to around 40-50% accuracy.
When AI Outgrows Tests
For years, researchers have used standardized tests to track AI capabilities. One well-known example is the Massive Multitask Language Understanding (MMLU) exam, which tests models in many academic subjects.
Today, many advanced AI systems perform well on these exams, prompting questions about whether these tests still provide meaningful insights into the true capabilities of artificial intelligence.
“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding,” said Dr. Tung Nguyen, an instructional associate professor of computer science and engineering at Texas A&M University and a contributor to the new benchmark. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”
an Exam Beyond AI’s Reach
The development of Humanity’s Last Exam involved nearly 1,000 researchers from diverse academic backgrounds. Each expert contributed questions that required advanced knowledge in their field, with every problem designed to have a single, verifiable answer.
The questions reflect a vast range of human knowledge, from translating ancient Palmyrene inscriptions to identifying microscopic anatomical features in birds and analyzing phonological details in Biblical Hebrew pronunciation. The team tested each question against leading AI models. If a model could correctly answer a question, it was excluded from the final set. This process resulted in an exam specifically designed to be outside the reach of current AI technology.
the Limits of Artificial Intelligence
Even the best AI models missed many questions during initial trials of the exam, especially those that required specific knowledge or skills from different fields. This means that while AI is good at pattern recognition, it still has trouble with advanced, specialized knowledge.
“Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do,” Nguyen said.
Although the name suggests a dramatic scenario, researchers created Humanity’s Last Exam to better understand the strengths and weaknesses of current AI systems, not to warn about AI replacing human expertise.
“This isn’t a race against AI,” Nguyen said. “It’s a method for understanding where these systems are strong and where they struggle.”
Designing a Benchmark for the Future
The team of specialists from various academic disciplines worked together to create a reliable way to test both computational ability and the wide range of knowledge that characterizes human expertise. Humanity’s Last Exam represents one of the most ambitious attempts to benchmark the capabilities of advanced AI systems. Researchers have kept most of the questions private to preserve the integrity of the exam as AI systems continue to improve.
“What made this project extraordinary was the scale,” Nguyen said. “Experts from nearly every discipline contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today’s AI systems —perhaps ironically, it’s humans working together.”
Austin Burgess is a writer and researcher with a background in sales, marketing, and data analytics. He holds a Master of Business Administration, a Bachelor of Science in Business Administration, and a Data Analytics certification. His work combines analytical training with a focus on emerging science, aerospace, and astronomical research.