Not so long ago, the peak of artificial intelligence was a helpful, animated paperclip.
But tech developers have claimed that large language models could outperform all human experts within a year — and they have built the test to prove it.
Humanity’s Last Exam (HLE) is a bank of 2,500 closed-format questions designed to see how far AI is from the “frontiers of human expertise”.
Two years ago, ChatGPT was able to answer fewer than 3 per cent of questions in the exam correctlyAlamy
Covering about a hundred topics, which range from mythology and physiology to rocket science, the questions were designed to require at least PhD-level understanding. Anyone — or anything — achieving even close to 100 per cent would be a “universal expert”.
AI is racing towards that point at a startling pace.
“We wanted to create this close-ended academic benchmark, set to the frontier of expert humans, that only a handful of people on earth can really solve,” said Calvin Zhang, the research lead at Scale, the AI company behind HLE.
“We’ve seen over the past few years insane progress on these language models. It’s impressive, model builders have really done a great job at improving these reasoning models.”
During initial tests in late 2024, OpenAI’s GPT-4o answered fewer than 3 per cent of HLE questions correctly. This proved that “a marked gap” remained between large language models (LLMs) and expert academics, researchers said.
Within months, Google Gemini could answer 18.8 per cent of the exam correctly. Last month, its latest model scored 45.9 per cent.
Researchers at Scale and the Center for AI Safety, a non-profit organisation, built HLE to test both breadth of knowledge and depth of reasoning — examining, for example, if a model could both identify and apply the correct mathematical formula.
The creators issued a global appeal for submissions in September 2024, asking for short-response questions with unambiguous answers that could not easily be searched for on the web.
Encouraged by a $500,000 prize pot, experts from about 50 countries submitted 70,000 questions. If any existing model could answer a question, it was rejected, allowing submissions to be whittled down to a set of 13,000 questions. In the end, about 2,500 questions were selected, although some have since been removed or edited based on user feedback.
The topics are deliberately eclectic: any successful LLM would have to know both the correct pronunciation of biblical Hebrew and the number of tendons attached to each bone of a hummingbird.
To prevent later models benefiting from answers being publicly discussed online, a technique known as “benchmark hacking”, a significant number of HLE questions have been kept secret.
Despite the difficulty of the questions and the safeguards against cheating, AI could get close to full marks within a year, Zhang believes.
Kate Olszewska, a product manager at Google DeepMind agrees: “If we truly cared about this as the only thing in life, I think we could get to it pretty quickly.”
The London headquarters of Google DeepMind, the lab behind GeminiJose Sarmento Matos/Bloomberg
This would not be the first time AI has scored highly in an “impossible” test. It was presumed that chess was too complicated for a machine until an IBM computer, Deep Blue, defeated Garry Kasparov, the world champion, in 1997. A similar shock came in the Asian board game Go in 2016.
Other AI benchmarks have struggled to keep up in the last few years. Measuring Massive Multitask Language Understanding, a multi-disciplinary, multiple-choice test, was released in 2020 but has already been largely discarded because the latest models can correctly answer up to 90 per cent of its questions.
HLE is fundamentally different, its authors claim, because it is “designed to be the final closed-ended academic benchmark of its kind”.
Once AI does crack this exam, AI models will have to be tested on questions that no human knows the answer to.
“Increasingly, the frontier a lot of people are interested in is novel problem discovery and solution,” Olszewska said. “It’s not just being able to solve grad school questions and tests, it’s about being able to assist researchers and scientists to actually do the discovery of new drugs or whatever else.”
Even the exam’s staunchest advocates recognise a “universal expert” bot could not easily replace a human specialist. Not every skill can be captured by a short-form written exam, including physical tasks such as surgery, for example, not to mention creativity and judgement.
“There’s a part of it which is taste, right? You want taste and you want the diversity of humans to always be there,” said Zhang.