{"id":258877,"date":"2026-01-23T00:20:07","date_gmt":"2026-01-23T00:20:07","guid":{"rendered":"https:\/\/www.newsbeep.com\/ie\/258877\/"},"modified":"2026-01-23T00:20:07","modified_gmt":"2026-01-23T00:20:07","slug":"are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-doubts","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ie\/258877\/","title":{"rendered":"Are AI agents ready for the workplace? A new benchmark raises doubts"},"content":{"rendered":"<p id=\"speakable-summary\" class=\"wp-block-paragraph\">It\u2019s been nearly two years since Microsoft CEO Satya Nadella predicted <a href=\"https:\/\/solmaz.io\/log\/2025\/02\/20\/satya-nadella-on-knowledge-work\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">AI would replace knowledge work<\/a> \u2014 the white-collar jobs held by lawyers, investment bankers, librarians, accountants, IT, and others.<\/p>\n<p class=\"wp-block-paragraph\">But despite the huge progress made by foundation models, the change in knowledge work has been slow to arrive. Models have mastered in-depth research and agentic planning, but for whatever reason, most white-collar work has been relatively unaffected. <\/p>\n<p class=\"wp-block-paragraph\">It\u2019s one of the biggest mysteries in AI \u2014 and thanks to new research from the training-data giant Mercor, we\u2019re finally getting some answers.<\/p>\n<p class=\"wp-block-paragraph\">The new research looks at how leading AI models hold up doing actual white-collar work tasks, drawn from consulting, investment banking, and law. The result is a new benchmark called <a href=\"https:\/\/arxiv.org\/abs\/2601.14242\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">APEX-Agents<\/a> \u2014 and so far, every AI lab is getting a failing grade. Faced with queries from real professionals, even the best models struggled to get more than a quarter of the questions right. The vast majority of the time, the model came back with a wrong answer or no answer at all.<\/p>\n<p class=\"wp-block-paragraph\">According to Mercor CEO Brendan Foody, who worked on the paper, the models\u2019 biggest stumbling point was tracking down information across multiple domains \u2014 something that\u2019s integral to most of the knowledge work performed by humans.<\/p>\n<p class=\"wp-block-paragraph\">\u201cOne of the big changes in this benchmark is that we built out the entire environment, modeled after real professional services,\u201d Foody told TechCrunch. \u201cThe way we do our jobs isn\u2019t with one individual giving us all the context in one place. In real life, you\u2019re operating across Slack and Google Drive and all these other tools.\u201d For many agentic AI models, that kind of multi-domain reasoning is still hit or miss.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" height=\"307\" width=\"680\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2026\/01\/Screen-Shot-2026-01-22-at-3.25.58-PM.jpg\" alt=\"\" class=\"wp-image-3085292\"  \/>Screenshot<\/p>\n<p class=\"wp-block-paragraph\">The scenarios were all drawn from actual professionals on Mercor\u2019s expert marketplace, who both laid out the queries and set the standard for a successful response. Looking through the questions, which are <a href=\"https:\/\/huggingface.co\/datasets\/mercor\/apex-agents\/blob\/main\/tasks_and_rubrics.json\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">posted publicly on Hugging Face<\/a>, gives a sense of how complex the tasks can get.\u00a0<\/p>\n<p>Techcrunch event<\/p>\n<p>\n\t\t\t\t\t\t\t\t\tSan Francisco<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t|<br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\tOctober 13-15, 2026\n\t\t\t\t\t\t\t<\/p>\n<p class=\"wp-block-paragraph\">One question in the \u201cLaw\u201d section reads:\u00a0<\/p>\n<p class=\"wp-block-paragraph\">During the first 48 minutes of the EU production outage, Northstar\u2019s engineering team exported one or two bundled sets of EU production event logs containing personal data to the U.S. analytics vendor\u00a0\u2026 Under Northstar\u2019s own policies, it can reasonably treat the one or two log exports as consistent with Article 49?<\/p>\n<p class=\"wp-block-paragraph\">The correct answer is yes, but getting there requires an in-depth assessment of the company\u2019s own policies as well as the relevant EU privacy laws.<\/p>\n<p class=\"wp-block-paragraph\">That might stump even a well-informed human, but the researchers were trying to model the work done by professionals in the field. If an LLM can reliably answer these questions, it could effectively replace many of the lawyers working today. \u201cI think this is probably the most important topic in the economy,\u201d Foody told TechCrunch. \u201cThe benchmark is very reflective of the real work that these people do.\u201d<\/p>\n<p class=\"wp-block-paragraph\">OpenAI also attempted to measure professional skills with <a href=\"https:\/\/openai.com\/index\/gdpval\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">its GDPval benchmark<\/a> \u2014 but the APEX-Agents test differs in important ways. Where GDPval tests general knowledge across a wide range of professions, the APEX-Agents benchmark measures the system\u2019s ability to perform sustained tasks in a narrow set of high-value professions. The result is more difficult for models, but also more closely tied to whether these jobs can be automated.<\/p>\n<p class=\"wp-block-paragraph\">While none of the models proved ready to take over as investment bankers, some were clearly closer to the mark. Gemini 3 Flash performed the best of the group with 24% one-shot accuracy, followed closely by GPT-5.2 with 23%. Below that, Opus 4.5, Gemini 3 Pro and GPT-5 all scored roughly 18%.<\/p>\n<p class=\"wp-block-paragraph\">While the initial results fall short, the AI field has a history of blowing through challenging benchmarks. Now that the APEX-Agents test is public, it\u2019s an open challenge for AI labs that believe they can do better \u2014 something Foody fully expects in the months to come.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">\u201cIt\u2019s improving really quickly,\u201d he told TechCrunch. \u201cRight now it\u2019s fair to say it\u2019s like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or 10% of the time. That kind of improvement year after year can have an impact so quickly.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"It\u2019s been nearly two years since Microsoft CEO Satya Nadella predicted AI would replace knowledge work \u2014 the&hellip;\n","protected":false},"author":2,"featured_media":258878,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[12484,1302,61,16611,60,128555,6717,80],"class_list":{"0":"post-258877","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-agentic-ai","9":"tag-exclusive","10":"tag-ie","11":"tag-investment-banking","12":"tag-ireland","13":"tag-knowledge-work","14":"tag-law","15":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/258877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/comments?post=258877"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/258877\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media\/258878"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media?parent=258877"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/categories?post=258877"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/tags?post=258877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}