{"id":354726,"date":"2026-03-30T05:47:13","date_gmt":"2026-03-30T05:47:13","guid":{"rendered":"https:\/\/www.newsbeep.com\/nz\/354726\/"},"modified":"2026-03-30T05:47:13","modified_gmt":"2026-03-30T05:47:13","slug":"ai-just-a-year-from-beating-all-experts-in-humanitys-last-exam","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/nz\/354726\/","title":{"rendered":"AI just a year from beating all experts in \u2018Humanity\u2019s Last Exam\u2019"},"content":{"rendered":"<p>Not so long ago, the peak of artificial intelligence was a helpful, animated paperclip. <\/p>\n<p>But tech developers have claimed that large language models could outperform all human experts within a year \u2014 and they have built the test to prove it.\u00a0<\/p>\n<p>Humanity\u2019s Last Exam (HLE) is a bank of 2,500 closed-format questions designed to see how far AI is from the \u201cfrontiers of human expertise\u201d.\u00a0<\/p>\n<p><img loading=\"lazy\" decoding=\"async\"   height=\"2592\" width=\"4608\" src=\"https:\/\/www.newsbeep.com\/nz\/wp-content\/uploads\/2026\/03\/12ec3de9-1b53-48bc-b819-64096f35b776.jpg\" alt=\"A person using the ChatGPT interface to write documents or homework.\" class=\"wp-image-21205021\"\/>Two years ago, ChatGPT was able to answer fewer than 3 per cent of questions in the exam correctlyAlamy<\/p>\n<p>Covering about a hundred topics, which range from mythology and physiology to rocket science, the questions were designed to require at least PhD-level understanding.\u00a0Anyone \u2014 or anything \u2014 achieving even close to 100 per cent would be a \u201cuniversal expert\u201d. <\/p>\n<p>AI is racing towards that point at a startling pace.\u00a0<\/p>\n<p>\u201cWe wanted to create this close-ended academic benchmark, set to the frontier of expert humans, that only a handful of people on earth can really solve,\u201d said Calvin Zhang, the research lead at Scale, the AI company behind HLE.<\/p>\n<p>\u201cWe\u2019ve seen over the past few years insane progress on these language models. It\u2019s impressive, model builders have really done a great job at improving these reasoning models.\u201d<\/p>\n<p>During initial tests in late 2024, OpenAI\u2019s GPT-4o answered fewer than 3 per cent of HLE questions correctly. This proved that \u201ca marked gap\u201d remained between large language models (LLMs) and expert academics, researchers said.\u00a0<\/p>\n<p>Within months, <a href=\"https:\/\/www.thetimes.com\/us\/news-today\/article\/google-gemini-ai-jonathan-gavalas-lawsuit-7525rnk6t\" rel=\"nofollow noopener\" target=\"_blank\">Google Gemini<\/a> could answer 18.8 per cent of the exam correctly. Last month, its latest model scored 45.9 per cent.<\/p>\n<p>Researchers at Scale and the Center for AI Safety, a non-profit organisation, built HLE to test both breadth of knowledge and depth of reasoning \u2014 examining, for example, if a model could both identify and apply the correct mathematical formula.<\/p>\n<p>The creators issued a global appeal for submissions in September 2024, asking for short-response questions with unambiguous answers that could not easily be searched for on the web.<\/p>\n<p>Encouraged by a $500,000 prize pot, experts from about 50 countries submitted 70,000 questions. If any existing model could answer a question, it was rejected, allowing submissions to be whittled down to a set of 13,000 questions. In the end, about 2,500 questions were selected, although some have since been removed or edited based on user feedback.\u00a0<\/p>\n<p>The topics are deliberately eclectic: any successful LLM would have to know both the correct pronunciation of biblical Hebrew and the number of tendons attached to each bone of a hummingbird.<\/p>\n<p>To prevent later models benefiting from answers being publicly discussed online, a technique known as \u201cbenchmark hacking\u201d, a significant number of HLE questions have been kept secret.\u00a0<\/p>\n<p>Despite the difficulty of the questions and the safeguards against cheating, AI could get close to full marks within a year, Zhang believes. <\/p>\n<p>Kate Olszewska, a product manager at <a href=\"https:\/\/www.thetimes.com\/uk\/technology-uk\/article\/deepmind-hails-kasparov-moment-as-ai-beats-best-human-coders-pbbbm8g96\" rel=\"nofollow noopener\" target=\"_blank\">Google DeepMind<\/a> agrees: \u201cIf we truly cared about this as the only thing in life, I think we could get to it pretty quickly.\u201d<\/p>\n<p><img loading=\"lazy\" decoding=\"async\"   height=\"2797\" width=\"4191\" src=\"https:\/\/www.newsbeep.com\/nz\/wp-content\/uploads\/2026\/03\/2a3bc5f0-5823-48fa-a33e-996bb8935e41.jpg\" alt=\"A Google DeepMind sign visible through the window of a building, showing the reception area with a sofa, desk, and plants.\" class=\"wp-image-21205026\"\/>The London headquarters of Google DeepMind, the lab behind GeminiJose Sarmento Matos\/Bloomberg<\/p>\n<p>This would not be the first time AI has scored highly in an \u201cimpossible\u201d test. It was presumed that chess was too complicated for a machine until an IBM computer, Deep Blue, <a href=\"https:\/\/www.thetimes.com\/culture\/tv-radio\/article\/rematch-review-disney-chess-garry-kasparov-00pdvwn8c\" rel=\"nofollow noopener\" target=\"_blank\">defeated Garry Kasparov<\/a>, the world champion, in 1997. A similar shock came in the Asian board game Go in 2016.\u00a0<\/p>\n<p>Other AI benchmarks have struggled to keep up in the last few years. Measuring Massive Multitask Language Understanding, a multi-disciplinary, multiple-choice test, was released in 2020 but has already been largely discarded because the latest models can correctly answer up to 90 per cent of its questions.<\/p>\n<p>HLE is fundamentally different, its authors claim, because it is \u201cdesigned to be the final closed-ended academic benchmark of its kind\u201d. <\/p>\n<p>Once AI does crack this exam, AI models will have to be tested on questions that no human knows the answer to.<\/p>\n<p>\u201cIncreasingly, the frontier a lot of people are interested in is novel problem discovery and solution,\u201d Olszewska said. \u201cIt\u2019s not just being able to solve grad school questions and tests, it\u2019s about being able to assist researchers and scientists to actually do the discovery of new drugs or whatever else.\u201d\u00a0<\/p>\n<p>Even the exam\u2019s staunchest advocates recognise a \u201cuniversal expert\u201d bot could not easily replace a human specialist. Not every skill can be captured by a short-form written exam, including physical tasks such as surgery, for example, not to mention creativity and judgement. <\/p>\n<p>\u201cThere\u2019s a part of it which is taste, right? You want taste and you want the diversity of humans to always be there,\u201d said Zhang.<\/p>\n","protected":false},"excerpt":{"rendered":"Not so long ago, the peak of artificial intelligence was a helpful, animated paperclip. But tech developers have&hellip;\n","protected":false},"author":2,"featured_media":354727,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[365,363,364,111,139,69,145],"class_list":{"0":"post-354726","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-new-zealand","12":"tag-newzealand","13":"tag-nz","14":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/posts\/354726","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/comments?post=354726"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/posts\/354726\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/media\/354727"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/media?parent=354726"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/categories?post=354726"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/nz\/wp-json\/wp\/v2\/tags?post=354726"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}