{"id":505517,"date":"2026-02-28T15:47:12","date_gmt":"2026-02-28T15:47:12","guid":{"rendered":"https:\/\/www.newsbeep.com\/ca\/505517\/"},"modified":"2026-02-28T15:47:12","modified_gmt":"2026-02-28T15:47:12","slug":"acing-this-new-ai-exam-which-its-creators-say-is-the-toughest-in-the-world-might-point-to-the-first-signs-of-agi","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ca\/505517\/","title":{"rendered":"Acing this new AI exam \u2014 which its creators say is the toughest in the world \u2014 might point to the first signs of AGI"},"content":{"rendered":"<p id=\"c3e162a2-64d5-4c7d-9983-d3bde4275a11\">Researchers at the Center for AI Safety and Scale AI have published &#8220;Humanity\u2019s Last Exam&#8221; \u2014 a test designed to measure how close today\u2019s most powerful <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.livescience.com\/technology\/artificial-intelligence\/what-is-artificial-intelligence-ai\" data-url=\"https:\/\/www.livescience.com\/technology\/artificial-intelligence\/what-is-artificial-intelligence-ai\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" data-before-rewrite-localise=\"https:\/\/www.livescience.com\/technology\/artificial-intelligence\/what-is-artificial-intelligence-ai\" rel=\"nofollow noopener\" target=\"_blank\">artificial intelligence<\/a> (AI) models are to meeting or exceeding human-level knowledge across several domains.<\/p>\n<p>The test was launched in January 2025, but scientists outlined the framework and their thinking behind its design for the first time in a new study published Jan. 28 in the journal <a data-analytics-id=\"inline-link\" href=\"https:\/\/go.redirectingat.com?id=92X1590019&amp;xcust=livescience_us_1268127288894344619&amp;xs=1&amp;url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41586-025-09962-4&amp;sref=https%3A%2F%2Fwww.livescience.com%2Ftechnology%2Fartificial-intelligence%2Facing-this-new-ai-exam-which-its-creators-say-is-the-toughest-in-the-world-might-point-to-the-first-signs-of-agi\" target=\"_blank\" data-url=\"https:\/\/www.nature.com\/articles\/s41586-025-09962-4\" referrerpolicy=\"no-referrer-when-downgrade\" rel=\"sponsored noopener nofollow\" data-hl-processed=\"skimlinks\" data-google-interstitial=\"false\" data-placeholder-url=\"https:\/\/go.redirectingat.com?id=92X1590019&amp;xcust=hawk-custom-tracking&amp;xs=1&amp;url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41586-025-09962-4&amp;sref=https%3A%2F%2Fwww.livescience.com%2Ftechnology%2Fartificial-intelligence%2Facing-this-new-ai-exam-which-its-creators-say-is-the-toughest-in-the-world-might-point-to-the-first-signs-of-agi\" data-mrf-recirculation=\"inline-link\">Nature<\/a>. It contains a corpus of 2,500 questions across more than 100 subjects, with input from more than 1,000 subject-matter experts from 500 institutions across 50 countries.<\/p>\n<p><a id=\"elk-seasonal\"\/><\/p>\n<p id=\"c3e162a2-64d5-4c7d-9983-d3bde4275a11-2\" class=\"paywall\" aria-hidden=\"true\">The exam consists of multiple-choice and short-answer questions, each of which has a known solution that is &#8220;unambiguous and easily verifiable but cannot be quickly answered by internet retrieval.&#8221;<\/p>\n<p>            You may like<\/p>\n<p id=\"1089f0dc-7fc8-485d-b543-d4c51673bc31\">At launch, the researchers tested OpenAI\u2019s GPT-4o and o1 models, Google\u2019s Gemini 1.5 Pro, Anthropic\u2019s Claude 3.5 Sonnet and DeepSeek R1. OpenAI\u2019s o1 system notched the top spot with a score of just 8.3%.<\/p>\n<p>Despite this poor performance, the researchers wrote at the time that &#8220;given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025.&#8221;<\/p>\n<p>As of Feb. 12, 2026, the highest score <a data-analytics-id=\"inline-link\" href=\"https:\/\/blog.google\/innovation-and-ai\/models-and-research\/gemini-models\/gemini-3-deep-think\/\" target=\"_blank\" data-url=\"https:\/\/blog.google\/innovation-and-ai\/models-and-research\/gemini-models\/gemini-3-deep-think\/\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" rel=\"nofollow noopener\">achieved so far<\/a> is 48.4%, set by Google\u2019s Gemini 3 Deep Think. Human experts, meanwhile, score around 90% in their respective domains.<\/p>\n<p><a id=\"elk-9f7556bd-d9a4-4f20-9472-1112a0ec04d6\" class=\"paywall\" aria-hidden=\"true\"\/>Testing the smartest machines in the world<\/p>\n<p id=\"5ba354a4-aa9c-4e2c-b9d7-c361582fdd82\">Humanity\u2019s Last Exam was intentionally designed to be extremely difficult for AI models. During early development, the researchers put out a global call for submissions from subject matter experts across numerous domains.<\/p>\n<p class=\"newsletter-form__strapline\">Get the world\u2019s most fascinating discoveries delivered straight to your inbox.<\/p>\n<p>The researchers enforced strict submission criteria requiring questions to be precise, unambiguous, solvable and non-searchable. They didn\u2019t want models to cheat by performing a simple web search, or for any of the questions to already appear online \u2014 thus increasing the likelihood a given model would have the answer in its training dataset.<\/p>\n<p>Each question submitted was then fed to the AI models. The team automatically rejected any questions the models could answer correctly.<\/p>\n<p>More than 70,000 submissions were attempted, resulting in approximately 13,000 questions that stumped LLMs. These were then vetted by a team of subject matter experts, approved by the research team, and presented to the scientific community for open feedback.<\/p>\n<p>            You may like<\/p>\n<p>Ultimately, the researchers narrowed the total submissions down to 2,500 questions that generally fall within the realm of PhD-level testing.<\/p>\n<p>An example of a trivia question in the exam is: \u201cIn Greek mythology, who was Jason\u2019s maternal great-grandfather?\u201d<\/p>\n<p>Meanwhile, an example of a physics question asks for the relationship between different forces during motion in a scenario where a block is placed on a horizontal rail (and can slide frictionlessly) while also being attached to a rigid, massless rod of an unknown length.<\/p>\n<p>The breadth of questions and scope of subjects covered by Humanity\u2019s Last Exam sets it apart from similar benchmarking tools, its creators say.<\/p>\n<p>Common tests, such as the <a data-analytics-id=\"inline-link\" href=\"https:\/\/artificialanalysis.ai\/evaluations\/mmlu-pro\" target=\"_blank\" data-url=\"https:\/\/artificialanalysis.ai\/evaluations\/mmlu-pro\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" rel=\"nofollow noopener\">Massive Multitask Language Understanding<\/a> (MMLU) dataset, which was authored with participation from Center for AI Safety founder <a data-analytics-id=\"inline-link\" href=\"https:\/\/scholar.google.com\/citations?user=czyretsAAAAJ&amp;hl=en\" target=\"_blank\" data-url=\"https:\/\/scholar.google.com\/citations?user=czyretsAAAAJ&amp;hl=en\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" rel=\"nofollow noopener\">Dan Hendrycks<\/a>, only test a small subset of expert-level domain knowledge, primarily focusing on coding and mathematics.<\/p>\n<p>Even state-of-the-art benchmarks such as Francois Chollet\u2019s <a data-analytics-id=\"inline-link\" href=\"https:\/\/arcprize.org\/arc-agi\/2\/\" target=\"_blank\" data-url=\"https:\/\/arcprize.org\/arc-agi\/2\/\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" rel=\"nofollow noopener\">ARC-AGI<\/a> suite struggle to outpace the memorization and searchability problems that the creators of Humanity\u2019s Last Exam suggest the new test addresses. Gemini\u2019s Deep Think, for example, achieved 84.6% on the ARC-AGI-2 benchmark, just a week after failing to reach 50% on the HLE test.<\/p>\n<p><a id=\"elk-3f715b05-71b0-444e-b7c8-1b449f1f4f22\" class=\"paywall\" aria-hidden=\"true\"\/>The ultimate prize is general intelligence<\/p>\n<p id=\"ed4cdb25-55f5-431c-86e3-679f56f18c71\">Humanity\u2019s Last Exam likely represents the AI world\u2019s best attempt to date at measuring the broad-spectrum capabilities of modern AI models relative to human experts, but the study&#8217;s authors categorically state that achieving a high score on the HLE is in no way indicative of the arrival of <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.livescience.com\/technology\/artificial-intelligence\/what-is-artificial-general-intelligence-agi\" data-url=\"https:\/\/www.livescience.com\/technology\/artificial-intelligence\/what-is-artificial-general-intelligence-agi\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" data-before-rewrite-localise=\"https:\/\/www.livescience.com\/technology\/artificial-intelligence\/what-is-artificial-general-intelligence-agi\" rel=\"nofollow noopener\" target=\"_blank\">artificial general intelligence<\/a> (AGI).<\/p>\n<p>&#8220;High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence,\u201d the scientists said in the study.<\/p>\n<p>&#8220;Doing well on HLE is a necessary, but not a sufficient criterion to say that machines have reached true intelligence,&#8221; <a data-analytics-id=\"inline-link\" href=\"https:\/\/schottdorflab.com\/team\/\" target=\"_blank\" data-url=\"https:\/\/schottdorflab.com\/team\/\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" rel=\"nofollow noopener\">Manuel Schottdorf<\/a>, a neuroscientist at the University of Delaware\u2019s Department of Psychological and Brain Sciences, said in a <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.udel.edu\/udaily\/2026\/february\/humanitys-last-exam-ai-benchmarking-manuel-schottdorf-cas\/\" target=\"_blank\" data-url=\"https:\/\/www.udel.edu\/udaily\/2026\/february\/humanitys-last-exam-ai-benchmarking-manuel-schottdorf-cas\/\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" data-mrf-recirculation=\"inline-link\" rel=\"nofollow noopener\">recent statement<\/a>. Schottdorf is one of the many experts whose question was accepted into the HLE\u2019s corpus.<\/p>\n<p>&#8220;They will have to be good enough to solve these questions, but that as a fact alone can&#8217;t allow us to conclude that machines are truly intelligent.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"Researchers at the Center for AI Safety and Scale AI have published &#8220;Humanity\u2019s Last Exam&#8221; \u2014 a test&hellip;\n","protected":false},"author":2,"featured_media":505518,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[62,276,277,49,48,61],"class_list":{"0":"post-505517","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-ca","12":"tag-canada","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/505517","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/comments?post=505517"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/505517\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media\/505518"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media?parent=505517"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/categories?post=505517"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/tags?post=505517"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}