{"id":446364,"date":"2026-01-30T01:59:07","date_gmt":"2026-01-30T01:59:07","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/446364\/"},"modified":"2026-01-30T01:59:07","modified_gmt":"2026-01-30T01:59:07","slug":"ai-is-failing-humanitys-last-exam-so-what-does-that-mean-for-machine-intelligence","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/446364\/","title":{"rendered":"AI is failing \u2018Humanity\u2019s Last Exam\u2019. So what does that mean for machine intelligence?"},"content":{"rendered":"<p>How do you translate ancient Palmyrene script from a Roman tombstone? How many paired tendons are supported by a specific sesamoid bone in a hummingbird? Can you identify closed syllables in Biblical Hebrew based on the latest scholarship on Tiberian pronunciation traditions?<\/p>\n<p>These are some of the questions in \u201cHumanity\u2019s Last Exam\u201d, a new benchmark introduced in a <a href=\"https:\/\/doi.org\/10.1038\/s41586-025-09962-4\" rel=\"nofollow noopener\" target=\"_blank\">study<\/a> published this week in Nature. The collection of 2,500 questions is specifically designed to probe the outer limits of what today\u2019s artificial intelligence (AI) systems cannot do.<\/p>\n<p>The benchmark represents a global collaboration of nearly 1,000 international experts across a range of academic fields. These academics and researchers contributed questions at the frontier of human knowledge. The problems required graduate-level expertise in mathematics, physics, chemistry, biology, computer science and the humanities. Importantly, every question was tested against leading AI models before inclusion. If an AI could not answer it correctly at the time the test was designed, the question was rejected.<\/p>\n<p>This process explains why the initial results looked so different from other benchmarks. While AI chatbots score above 90% on <a href=\"https:\/\/arxiv.org\/abs\/2009.03300\" rel=\"nofollow noopener\" target=\"_blank\">popular tests<\/a>, when Humanity\u2019s Last Exam was first released in early 2025, leading models struggled badly. GPT-4o managed just 2.7% accuracy. Claude 3.5 Sonnet scored 4.1%. Even OpenAI\u2019s most powerful model, o1, achieved only 8%.<\/p>\n<p>The low scores were the point. The benchmark was constructed to measure what remained beyond AI\u2019s grasp. And while some <a href=\"https:\/\/uk.finance.yahoo.com\/news\/google-ai-posts-top-marks-115919456.html\" rel=\"nofollow noopener\" target=\"_blank\">commentators have suggested<\/a> that benchmarks like Humanity\u2019s Last Exam chart a path toward artificial general intelligence, or even superintelligence \u2013 that is, AI systems capable of performing any task at human or superhuman levels \u2013 we believe this is wrong for three reasons. <\/p>\n<p>Benchmarks measure task performance, not intelligence<\/p>\n<p>When a student scores well on the bar exam, we can reasonably predict they\u2019ll make a competent lawyer. That\u2019s because the test was designed to assess whether humans have acquired the knowledge and reasoning skills needed for legal practice \u2013 and for humans, that works. The understanding required to pass genuinely transfers to the job.<\/p>\n<p>But AI systems are not humans preparing for careers. <\/p>\n<p>When a large language model scores well on the bar exam, it tells us the model can produce correct-looking answers to legal questions. It doesn\u2019t tell us the model understands law, can counsel a nervous client, or exercise professional judgment in ambiguous situations. <\/p>\n<p>The test measures something real for humans; for AI it measures only performance on the test itself.<\/p>\n<p>Using human ability tests to benchmark AI is common practice, but it\u2019s fundamentally misleading. Assuming a high test score means the machine has become more human-like is a category error, much like concluding that a calculator \u201cunderstands\u201d mathematics because it can solve equations faster than any person.<\/p>\n<p>Human and machine intelligence are fundamentally different<\/p>\n<p>Humans learn continuously from experience. We have intentions, needs and goals. We live lives, inhabit bodies and experience the world directly. Our intelligence evolved to serve our survival as organisms and our success as social creatures. <\/p>\n<p>But AI systems <a href=\"https:\/\/scholarspace.manoa.hawaii.edu\/items\/e1abcaaf-c75f-48f7-849d-769c399d7698\" rel=\"nofollow noopener\" target=\"_blank\">are very different<\/a>.<\/p>\n<p>Large language models derive their capabilities from patterns in text during training. But <a href=\"https:\/\/theconversation.com\/ai-doesnt-really-learn-and-knowing-why-will-help-you-use-it-more-responsibly-250923\" rel=\"nofollow noopener\" target=\"_blank\">they don\u2019t really learn<\/a>. <\/p>\n<p>For humans, intelligence comes first and language serves as a tool for communication \u2013 <a href=\"https:\/\/doi.org\/10.1038\/s41586-024-07522-w\" rel=\"nofollow noopener\" target=\"_blank\">intelligence is prelinguistic<\/a>. But for large language models, language is the intelligence \u2013 there\u2019s nothing underneath.<\/p>\n<p>Even the creators of Humanity\u2019s Last Exam <a href=\"https:\/\/doi.org\/10.1038\/s41586-025-09962-4\" rel=\"nofollow noopener\" target=\"_blank\">acknowledge<\/a> this limitation:<\/p>\n<p>High accuracy on [Humanity\u2019s Last Exam] would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence.<\/p>\n<p>Subbarao Kambhampati, professor at Arizona State University and former president of the Association for the Advancement of Artificial Intelligence, <a href=\"https:\/\/the-decoder.com\/frontier-models-fail-hard-at-humanitys-last-exam-but-experts-question-if-it-matters\/\" rel=\"nofollow noopener\" target=\"_blank\">puts it more clearly<\/a>: <\/p>\n<p>Humanity\u2019s essence isn\u2019t captured by a static test but rather by our ability to evolve and tackle previously unimaginable questions.<\/p>\n<p>Developers like leaderboards<\/p>\n<p>There\u2019s another problem. AI developers use benchmarks to optimise their models for leaderboard performance. They\u2019re essentially cramming for the exam. And unlike humans, for whom the learning for the test builds understanding, AI optimisation just means getting better at the specific test.<\/p>\n<p>But it\u2019s working. <\/p>\n<p>Since Humanity\u2019s Last Exam was published online in early 2025, <a href=\"https:\/\/lastexam.ai\/\" rel=\"nofollow noopener\" target=\"_blank\">scores have climbed dramatically<\/a>. Gemini 3 Pro Preview now tops the leaderboard at 38.3% accuracy, followed by GPT-5 at 25.3% and Grok 4 at 24.5%.<\/p>\n<p>Does this improvement mean these models are approaching human intelligence? No. It means they\u2019ve gotten better at the kinds of questions the exam contains. The benchmark has become a target to optimise against.<\/p>\n<p>The industry is recognising this problem. <\/p>\n<p><a href=\"https:\/\/openai.com\/index\/gdpval\/\" rel=\"nofollow noopener\" target=\"_blank\">OpenAI recently introduced a measure called GDPval<\/a> specifically designed to assess real-world usefulness. <\/p>\n<p>Unlike academic-style benchmarks, GDPval focuses on tasks based on actual work products such as project documents, data analyses and deliverables that exist in professional settings.<\/p>\n<p>What this means for you<\/p>\n<p>If you\u2019re using AI tools in your work or considering adopting them, don\u2019t be swayed by benchmark scores. A model that aces Humanity\u2019s Last Exam might still struggle with the specific tasks you need done.<\/p>\n<p>It\u2019s also worth noting the exam\u2019s questions are heavily skewed toward certain domains. Mathematics alone accounts for 41% of the benchmark, with physics, biology and computer science making up much of the rest. If your work involves writing, communication, project management or customer service, the exam tells you almost nothing about which model might serve you best.<\/p>\n<p>A practical approach is to devise your own tests based on what you actually need AI to do, then evaluate newer models against criteria that matter to you. AI systems are genuinely useful \u2013 but any discussion about superintelligence remains science fiction and a distraction from the real work of making these tools relevant to people\u2019s lives.<\/p>\n","protected":false},"excerpt":{"rendered":"How do you translate ancient Palmyrene script from a Roman tombstone? How many paired tendons are supported by&hellip;\n","protected":false},"author":2,"featured_media":446365,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-446364","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/446364","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=446364"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/446364\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/446365"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=446364"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=446364"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=446364"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}