{"id":266520,"date":"2025-11-06T23:11:16","date_gmt":"2025-11-06T23:11:16","guid":{"rendered":"https:\/\/www.newsbeep.com\/ca\/266520\/"},"modified":"2025-11-06T23:11:16","modified_gmt":"2025-11-06T23:11:16","slug":"ai-capabilities-may-be-overhyped-on-bogus-benchmarks-study-finds","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ca\/266520\/","title":{"rendered":"AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds"},"content":{"rendered":"<p>You know all of those reports about artificial intelligence models successfully <a href=\"https:\/\/law.stanford.edu\/2023\/04\/19\/gpt-4-passes-the-bar-exam-what-that-means-for-artificial-intelligence-tools-in-the-legal-industry\/\" rel=\"nofollow noopener\" target=\"_blank\">passing the bar<\/a> or <a href=\"https:\/\/futurism.com\/ceo-deepmind-openai-phd-ai\" rel=\"nofollow noopener\" target=\"_blank\">achieving Ph.D.-level intelligence<\/a>? Looks like we should start taking those degrees back. A <a href=\"https:\/\/openreview.net\/pdf?id=mdA5lVvNcU\" rel=\"nofollow noopener\" target=\"_blank\">new study from researchers at the Oxford Internet Institute<\/a> suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading.<\/p>\n<p>Researchers looked at 445 different benchmark tests used by the industry and other academic outfits to test everything from <a href=\"https:\/\/arcprize.org\/arc-agi\/2\/\" rel=\"nofollow noopener\" target=\"_blank\">reasoning capabilities<\/a> to <a href=\"https:\/\/livebench.ai\/#\/\" rel=\"nofollow noopener\" target=\"_blank\">performance on coding tasks<\/a>. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared.<\/p>\n<p>A big problem that the researchers <a href=\"https:\/\/oxrml.com\/measuring-what-matters\/\" rel=\"nofollow noopener\" target=\"_blank\">found<\/a> is that \u201cMany benchmarks are not valid measurements of their intended targets.\u201d That is to say, while a benchmark may claim to measure a specific skill, it could identify that skill in a way that doesn\u2019t actually capture a model\u2019s capability.<\/p>\n<p>For example, the researchers point to the Grade School Math 8K (GSM8K) benchmarking test, which measures a model\u2019s performance on grade school-level word-based math problems designed to push the model into \u201cmulti-step mathematical reasoning.\u201d The GSM8K is <a href=\"https:\/\/arxiv.org\/pdf\/2110.14168\" rel=\"nofollow noopener\" target=\"_blank\">advertised<\/a> as being \u201cuseful for probing the informal reasoning ability of large language models.\u201d<\/p>\n<p>But the researchers argue that the test doesn\u2019t necessarily tell you if a model is engaging in reasoning. \u201cWhen you ask a first grader what two plus five equals and they say seven, yes, that\u2019s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no,\u201d Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, <a href=\"https:\/\/www.nbcnews.com\/tech\/tech-news\/ai-chatgpt-test-smart-capabilities-may-exaggerated-flawed-study-rcna241969\" rel=\"nofollow noopener\" target=\"_blank\">told NBC News<\/a>.<\/p>\n<p>In the study, the researchers pointed out that GSM8K scores have increased over time, which may point to models getting better at this kind of reasoning and performance. But it may also point to contamination, which happens when benchmark test questions make it into the model\u2019s dataset or the model starts \u201cmemorizing\u201d answers or information rather than reasoning its way to a solution. When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced \u201csignificant performance drops.\u201d<\/p>\n<p>While this study is among the largest reviews of AI benchmarking, it\u2019s not the first to suggest this system of measurement may not be all that it\u2019s sold to be. Last year, <a href=\"https:\/\/hai.stanford.edu\/assets\/files\/hai-policy-brief-what-makes-a-good-ai-benchmark.pdf\" rel=\"nofollow noopener\" target=\"_blank\">researchers at Stanford<\/a> analyzed several popular AI model benchmark tests and found \u201clarge quality differences between them, including those widely relied on by developers and policymakers,\u201d and noted that most benchmarks \u201care highest quality at the design stage and lowest quality at the implementation stage.\u201d<\/p>\n<p>If nothing else, the research is a good reminder that these performance measures, while often well-intended and meant to provide an accurate analysis of a model, can be turned into little more than marketing speak for companies.<\/p>\n","protected":false},"excerpt":{"rendered":"You know all of those reports about artificial intelligence models successfully passing the bar or achieving Ph.D.-level intelligence?&hellip;\n","protected":false},"author":2,"featured_media":266521,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[62,2139,276,277,214,49,48,4120,61],"class_list":{"0":"post-266520","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intellience","10":"tag-artificial-intelligence","11":"tag-artificialintelligence","12":"tag-benchmarks","13":"tag-ca","14":"tag-canada","15":"tag-emerging-technologies","16":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/266520","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/comments?post=266520"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/posts\/266520\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media\/266521"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/media?parent=266520"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/categories?post=266520"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ca\/wp-json\/wp\/v2\/tags?post=266520"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}