{"id":274494,"date":"2025-11-06T05:07:08","date_gmt":"2025-11-06T05:07:08","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/274494\/"},"modified":"2025-11-06T05:07:08","modified_gmt":"2025-11-06T05:07:08","slug":"ais-capabilities-may-be-exaggerated-by-flawed-tests-study-says","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/274494\/","title":{"rendered":"AI\u2019s capabilities may be exaggerated by flawed tests, study says"},"content":{"rendered":"<p id=\"anchor-6412d1\" class=\"body-graf\">Researchers behind a new study say that the methods used to evaluate AI systems\u2019 capabilities routinely oversell AI performance and lack scientific rigor. <\/p>\n<p id=\"anchor-73694e\" class=\"body-graf\">The study, led by researchers at the Oxford Internet Institute in partnership with over three dozen researchers from other institutions, <a href=\"https:\/\/oxrml.com\/measuring-what-matters\/\" target=\"_blank\" rel=\"nofollow noopener\">examined<\/a> 445 leading AI tests, called benchmarks, often used to measure the performance of AI models across a variety of topic areas. <\/p>\n<p id=\"anchor-a9c240\" class=\"body-graf\">AI developers and researchers use these benchmarks to <a href=\"https:\/\/epoch.ai\/benchmarks\" target=\"_blank\" rel=\"nofollow noopener\">evaluate model abilities<\/a> and <a href=\"https:\/\/blog.google\/technology\/ai\/2024-ai-extraordinary-progress-advancement\/\" target=\"_blank\" rel=\"nofollow noopener\">tout technical progress<\/a>, referencing them to make claims on topics ranging from <a href=\"https:\/\/www.anthropic.com\/engineering\/swe-bench-sonnet\" target=\"_blank\" rel=\"nofollow noopener\">software engineering performance<\/a> to <a href=\"https:\/\/x.com\/arcprize\/status\/1943168950763950555?lang=en\" target=\"_blank\" rel=\"nofollow\">abstract-reasoning capacity<\/a>. However, the paper, released Tuesday, claims these fundamental tests might not be reliable and calls into question the validity of many benchmark results.  <\/p>\n<p id=\"anchor-209c55\" class=\"body-graf\">According to the study, a significant number of top-tier benchmarks fail to define what exactly they aim to test, concerningly reuse data and testing methods from pre-existing benchmarks, and seldom use reliable statistical methods to compare results between models.<\/p>\n<p id=\"anchor-51ca64\" class=\"body-graf\">Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, argued these benchmarks can be alarmingly misleading: \u201cWhen we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure,\u201d Mahdi told NBC News. <\/p>\n<p id=\"anchor-efc2e7\" class=\"body-graf\">Andrew Bean, a researcher at the Oxford Internet Institute and another lead author of the study, concurred that even reputable benchmarks are too often blindly trusted and deserve more scrutiny.<\/p>\n<p id=\"anchor-36ca05\" class=\"body-graf\">\u201cYou need to really take it with a grain of salt when you hear things like \u2018a model achieves Ph.D. level intelligence,\u2019\u201d Bean told NBC News. \u201cWe\u2019re not sure that those measurements are being done especially well.\u201d<\/p>\n<p id=\"anchor-9a287a\" class=\"body-graf\">Some of the benchmarks examined in the analysis measure specific skills, like Russian or Arabic language abilities, while other benchmarks measure more general capabilities, like spatial reasoning and continual learning. <\/p>\n<p id=\"anchor-ed9dae\" class=\"body-graf\">A core issue for the authors was whether a benchmark is a good test of the real-world phenomenon it aims to measure, or what the authors label as \u201cconstruct validity.\u201d Instead of testing a model on an endless series of questions to evaluate its ability to speak Russian, for example, one benchmark reviewed in the study measures a model\u2019s performance on nine different tasks, like answering yes-or-no questions using information drawn from Russian-language Wikipedia. <\/p>\n<p id=\"anchor-16d106\" class=\"body-graf\">However, roughly half of the benchmarks examined in the study fail to clearly define the concepts they purport to measure, casting doubt on benchmarks\u2019 ability to yield useful information about the AI models being tested.<\/p>\n<p id=\"anchor-fac816\" class=\"body-graf\">As an example, in the study the authors showcase a common AI benchmark called Grade School Math 8K (GSM8K), which measures performance on a <a href=\"https:\/\/huggingface.co\/datasets\/openai\/gsm8k\" target=\"_blank\" rel=\"nofollow noopener\">set of basic math questions<\/a>. Observers often point to <a href=\"https:\/\/llm-stats.com\/benchmarks\/gsm8k\" target=\"_blank\" rel=\"nofollow noopener\">leaderboards<\/a> on the GSM8K benchmark to show that AI models are highly capable at fundamental mathematical reasoning, and the benchmark\u2019s <a href=\"https:\/\/arxiv.org\/pdf\/2110.14168\" target=\"_blank\" rel=\"nofollow noopener\">documentation says it is<\/a> \u201cuseful for probing the informal reasoning ability of large language models.\u201d<\/p>\n<p id=\"anchor-79d06e\" class=\"body-graf\">Yet correct answers on benchmarks like GSM8K do not necessarily mean the model is actually engaging in mathematical reasoning, study author Mahdi said. \u201cWhen you ask a first grader what two plus five equals and they say seven, yes, that\u2019s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no.\u201d<\/p>\n<p id=\"anchor-5a0c08\" class=\"body-graf\">Bean acknowledged that measuring nebulous concepts like reasoning requires evaluating a subset of tasks, and that such selection will invariably be imperfect. \u201cThere are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure,\u201d he said. <\/p>\n<p id=\"anchor-e2dd6a\" class=\"body-graf\">\u201cWith concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, \u2018Great, now I\u2019ve measured it,\u2019\u201d Bean added.<\/p>\n<p id=\"anchor-0d4a77\" class=\"body-graf\">In the new paper, the authors make eight recommendations and provide a checklist to systematize benchmark criteria and improve the transparency and trust in benchmarks. The suggested improvements include specifying the scope of the particular action being evaluated, constructing batteries of tasks that better represent the overall abilities being measured, and comparing models\u2019 performance via statistical analysis. <\/p>\n<p id=\"anchor-fce3aa\" class=\"body-graf\">Nikola Jurkovic, a member of technical staff at the influential METR AI research center, commended the paper\u2019s contributions. \u201cWe need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful,\u201d Jurkovic told NBC News.<\/p>\n<p id=\"anchor-85fe3b\" class=\"body-graf\">Tuesday\u2019s paper builds on <a href=\"https:\/\/arxiv.org\/abs\/2402.09880\" target=\"_blank\" rel=\"nofollow noopener\">previous research<\/a> <a href=\"https:\/\/betterbench.stanford.edu\/\" target=\"_blank\" rel=\"nofollow noopener\">pointing out flaws<\/a> in many AI benchmarks. <\/p>\n<p id=\"anchor-bffd10\" class=\"body-graf\">Last year, <a href=\"https:\/\/www.anthropic.com\/research\/statistical-approach-to-model-evals\" target=\"_blank\" rel=\"nofollow noopener\">researchers from AI company Anthropic<\/a> advocated for increased statistical testing to determine whether a model\u2019s performance on a specific benchmark really showed a difference in capabilities or was rather just a lucky result given the tasks and questions included in the benchmark. <\/p>\n<p id=\"anchor-913483\" class=\"body-graf\">To attempt to increase the usefulness and accuracy of benchmarks, several research groups have recently proposed new series of tests that better measure models\u2019 real-world performance on economically meaningful tasks. <\/p>\n<p id=\"anchor-1f0c00\" class=\"body-graf\">In late September, OpenAI <a href=\"https:\/\/openai.com\/index\/gdpval\/\" target=\"_blank\" rel=\"nofollow noopener\">released a new series of tests<\/a> that evaluate AI\u2019s performance on tasks required for 44 different occupations, in an attempt to better ground claims of AI capabilities in the real world. For example, the tests measure AI\u2019s ability to fix inconsistencies in customer invoices Excel spreadsheets for an imaginary sales analyst role, or AI\u2019s ability to create a full production schedule for a 60-second video shoot for an imaginary video producer.<\/p>\n<p id=\"anchor-9057db\" class=\"body-graf\">Dan Hendrycks, director of the Center for AI Safety, and a team of researchers recently released a similar <a href=\"https:\/\/arxiv.org\/abs\/2510.26787\" target=\"_blank\" rel=\"nofollow noopener\">real-world benchmark<\/a> designed to evaluate AI systems\u2019 performance on a range of tasks necessary for the automation of remote work.  <\/p>\n<p id=\"anchor-3c39c2\" class=\"body-graf\">\u201cIt\u2019s common for AI systems to score high on a benchmark but not actually solve the benchmark\u2019s actual goal,\u201d Hendrycks told NBC News.<\/p>\n<p id=\"anchor-d4e08d\" class=\"endmark body-graf\">Surveying the broader landscape of AI benchmarks, Mahdi said researchers and developers have many exciting avenues to explore. \u201cWe are just at the very beginning of the scientific evaluation of AI systems,\u201d Mahdi said. <\/p>\n","protected":false},"excerpt":{"rendered":"Researchers behind a new study say that the methods used to evaluate AI systems\u2019 capabilities routinely oversell AI&hellip;\n","protected":false},"author":2,"featured_media":274495,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[182,181,507,74],"class_list":{"0":"post-274494","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/274494","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=274494"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/274494\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/274495"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=274494"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=274494"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=274494"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}