{"id":192944,"date":"2025-09-30T19:05:06","date_gmt":"2025-09-30T19:05:06","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/192944\/"},"modified":"2025-09-30T19:05:06","modified_gmt":"2025-09-30T19:05:06","slug":"top-ai-models-are-getting-really-good-at-completing-professional-tasks-new-openai-gdpval-benchmark-shows","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/192944\/","title":{"rendered":"Top AI models are getting really good at completing professional tasks, new OpenAI GDPval benchmark shows"},"content":{"rendered":"<p>Hello and welcome to Eye on AI. In this edition\u2026A new OpenAI benchmark shows how good models are getting at completing professional tasks\u2026California has a new AI law\u2026OpenAI rolls out Instant Purchases in ChatGPT\u2026and AI can pick winning founders better than most VCs.<\/p>\n<p><a href=\"https:\/\/fortune.com\/company\/alphabet\/\" target=\"_blank\" aria-label=\"Go to https:\/\/fortune.com\/company\/alphabet\/\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">Google<\/a> CEO Sundar Pichai was right <a href=\"https:\/\/www.businessinsider.com\/aji-artificial-jagged-intelligence-google-ceo-sundar-pichai-2025-6\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.businessinsider.com\/aji-artificial-jagged-intelligence-google-ceo-sundar-pichai-2025-6\" class=\"sc-5ad7098d-0 lcJVdL\">when he said<\/a> that while AI companies aspire to create AGI (artificial general intelligence), what we have right now is more like AJI\u2014artificial jagged intelligence. What Pichai meant by this is that today\u2019s AI is brilliant at some things, including some tasks that even human experts find difficult, while also performing poorly at some tasks that a human would find relatively easy.<\/p>\n<p>Thinking of AI in this way partly explains the confusing set of headlines we\u2019ve seen about AI lately\u2014<a href=\"https:\/\/www.newscientist.com\/article\/2489248-deepmind-and-openai-claim-gold-in-international-mathematical-olympiad\/\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.newscientist.com\/article\/2489248-deepmind-and-openai-claim-gold-in-international-mathematical-olympiad\/\" class=\"sc-5ad7098d-0 lcJVdL\">acing international math<\/a> and <a href=\"https:\/\/www.ft.com\/content\/c2f7e7ef-df7b-4b74-a899-1cb12d663ce6\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.ft.com\/content\/c2f7e7ef-df7b-4b74-a899-1cb12d663ce6\" class=\"sc-5ad7098d-0 lcJVdL\">coding competitions<\/a>, while many AI projects fail to achieve a return on investment and people complain about AI-created <a href=\"https:\/\/fortune.com\/2025\/09\/23\/ai-workslop-workshop-workplace-communication\/\" target=\"_self\" aria-label=\"Go to https:\/\/fortune.com\/2025\/09\/23\/ai-workslop-workshop-workplace-communication\/\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">\u201cworkslop\u201d<\/a> being a drag on productivity. (More on some of these pessimistic studies later. Needless to say, there is often a lot less to these headlines than meets the eye.)<\/p>\n<p>One of the reasons for the seeming disparity in AI\u2019s capabilities is that many AI benchmarks do not reflect real world use cases. Which is why a new gauge <a href=\"https:\/\/openai.com\/index\/gdpval\/\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/openai.com\/index\/gdpval\/\" class=\"sc-5ad7098d-0 lcJVdL\">published by OpenAI<\/a> last week is so important. Called GDPval, the benchmark evaluates leading AI models on real-world tasks, curated by experts\u00a0from across 44 different professions, representing nine different sectors of the economy. The experts had an average of 14 years experience in their fields, which ranged from law and finance to retail and manufacturing, as well as government and healthcare.\u00a0<\/p>\n<p>Whereas a traditional AI benchmark might test a model\u2019s capability to answer a multiple choice bar exam question about contract law, for example, the GDPval assessment asks the AI model to craft an entire 3,500 word legal memo assessing the standard of review under Delaware law that a public company founder and CEO, with majority control, would face if he wanted this public company to acquire a private company that he also owned.<\/p>\n<p>OpenAI tested not only its own models, but those from a number of other leading labs, including Google DeepMind\u2019s Gemini 2.5 Pro, Anthropic\u2019s Claude Opus 4.1, and Grok\u2019s Grok 4. Of these, Claude Opus 4.1 consistently performed the best, beating or equaling human expert performance on 47.6% of the total tasks. (Big kudos to OpenAI for intellectual honesty in publishing a study in which its own models were not top of the heap.)<\/p>\n<p>There was a lot of variance between models, with Gemini and Grok often able to complete between a third and a fifth of tasks at or above the standard of human experts, while OpenAI\u2019s GPT-5 Thinking\u2019s performance fell between that of Claude Opus 4.1 and Gemini, and OpenAI\u2019s earlier model, GPT-4o, fared the worst of all, barely able to complete 10% of the tasks to professional standard. GPT-5 was the best at following a prompt correctly, but often failed to format its response properly, according to the researchers. Gemini and Grok seemed to have the most problems with following instructions\u2014sometimes failing to provide the delivered outcome and ignoring reference data\u2014but OpenAI did note that \u201call the models sometimes hallucinated data or miscalculated.\u201d<\/p>\n<p>Big differences across sectors and professions<\/p>\n<p>There was also a bit of variance between economic sectors, with the models performing best on tasks from government, retail, and the wholesale trade, and generally worst on tasks from the manufacturing sector.<\/p>\n<p>For some professional tasks, Claude Opus 4.1\u2019s performance was off the charts: it beat or equalled human performance for 81% of the tasks taken from \u201ccounter and rental clerks,\u201d 76% of those taken from shipping clerks, 70% of those from software development, and, intriguingly, 70% of the tasks taken from the work of private investigators and detectives. (Forget Sherlock Holmes, just call Claude!) GPT-5 Thinking beat human experts on 79% of the tasks that sales manager perform and 75% of those that editors perform (gulp!).<\/p>\n<p>On others, human experts won handily. The models were all notably poor at performing tasks related to the work of film and video editors, producers and directors, and audio and video technicians. So Hollywood may be breathing a sigh of relief. The models also fell down on tasks related to pharmacists\u2019 jobs.<\/p>\n<p>When AI models failed to equal or exceed human performance, it was rarely in ways that human experts judged \u201ccatastrophic\u201d\u2014that only occurred about 2.7% of the time with GPT-5 failures. But the GPT-5 response was judged \u201cbad\u201d in another 26.7% of these cases, and \u201cacceptable but subpar\u201d in 47.7% of cases where human outputs were deemed superior.<\/p>\n<p>The need for \u2018Centaur\u2019 benchmarks<\/p>\n<p>I asked Erik Brynjolfsson, the Stanford University economist at the Human-Centered AI Institute (HAI) who has done some of the best research to date on the economic impact of generative AI, what he thought of <a href=\"https:\/\/cdn.openai.com\/pdf\/d5eb7428-c4e9-4a33-bd86-86dd4bcf12ce\/GDPval.pdf\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/cdn.openai.com\/pdf\/d5eb7428-c4e9-4a33-bd86-86dd4bcf12ce\/GDPval.pdf\" class=\"sc-5ad7098d-0 lcJVdL\">GDPval and the results<\/a>. He said the assessment goes a long way to closing the gap that has developed between AI researchers and their preferred benchmarks, which are often highly technical but don\u2019t match real-world problems. Brynjolfsson said he thought GDPval would \u201cinspire AI researchers to think more about how to design their systems to be useful in doing practical work, not just ace the technical benchmarks.\u201d He also said that \u201cin practice, that means integrating technology into workflows and more often than not, actively involving humans.\u201d<\/p>\n<p>Brynjolfsson said he and colleague Andy Haupt had been arguing for <a href=\"https:\/\/digitaleconomy.stanford.edu\/publications\/position-ai-should-not-be-an-imitation-game-centaur-evaluations\/\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/digitaleconomy.stanford.edu\/publications\/position-ai-should-not-be-an-imitation-game-centaur-evaluations\/\" class=\"sc-5ad7098d-0 lcJVdL\">\u201cCentaur Evaluations\u201d<\/a> which judge how well humans perform when paired with, and assisted by, an AI model, rather than always seeing the AI model as a replacement for human workers. (The term comes from the idea of \u201ccentaur chess,\u201d which is what it is called when human grandmasters are assisted by chess computers. The pairing was found to exceed what either humans or machines could do alone. And, of course, centaur here refers to the mythical half-man, half-horse of Greek mythology.)<\/p>\n<p>GDPval did make some steps toward doing this, looking in one case at how much time and money was saved when OpenAI\u2019s models were allowed to try a task multiple times, with the human then coming in to fix the output if it was not up to standard. Here, GPT-5 was found to offer both a 1.5x speedup and 1.5x cost improvement over the human expert working without AI assistance. (Less capable OpenAI models did not help as much, with GPT-4o actually leading to a slowdown and cost increase over the human expert working unassisted.)<\/p>\n<p>About that AI workslop research\u2026<\/p>\n<p>This last point, along with the \u201cacceptable but subpar\u201d label that characterized a good portion of the cases where the AI models did not equal human performance, brings me back to that \u201cworkslop\u201d research that came out last week. This may, in fact, be what is happening with some AI outputs in corporate settings, especially as the most capable models\u2014such as GPT-5, Claude 4.1 Opus, and Gemini 2.5 Pro\u2014are only being used by a handful of companies at scale. That said, as the journalist Adam Davidson pointed out in <a href=\"https:\/\/www.linkedin.com\/posts\/adamjdavidson_i-just-read-the-stanfordbetterup-study-on-activity-7377011127806967808-TAiW?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAA0ruw0BbuW0X2plb8fqaPFfU52oUkTA7Xo\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.linkedin.com\/posts\/adamjdavidson_i-just-read-the-stanfordbetterup-study-on-activity-7377011127806967808-TAiW?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAA0ruw0BbuW0X2plb8fqaPFfU52oUkTA7Xo\" class=\"sc-5ad7098d-0 lcJVdL\">a Linkedin post<\/a>, the \u201cWorkslop\u201d study\u2014just like that now infamous MIT study about 95% of AI pilots failing to produce ROI\u2014had some very serious flaws. The \u201cworkslop\u201d study was based on an open online survey that asked highly leading questions. It was essentially a \u201cpush poll\u201d designed to generate an attention-grabbing headline about the problem of AI workslop more than a piece of intellectually-honest research. But it worked\u2014it got lots of headlines, including in Fortune.<br \/>If one focuses on these kinds of headlines, it is all too easy to miss the other side of what is happening in AI, which is the story that GDPval tells: the best performing AI models are already on par with human expertise on many tasks. (And remember that GDPval has so far been tested only on Anthropic\u2019s Claude Opus 4.1, not its new Claude Sonnet 4.5 that was released yesterday and which can work continuously on a task for up to 30 hours, far longer than any previous model.) This doesn\u2019t mean AI can replace these professional experts any time soon. As Brynjolfsson\u2019s work has shown, most jobs consist of dozens of different tasks, and AI can only equal or beat human performance on some of them. In many cases, a human needs to be in the loop to correct the outputs when a model fails (which, as GDPval shows, is still happening at least 20% of the time, even on the professional tasks where the models perform best.) But AI is making inroads, sometimes rapidly, in many domains\u2014and more and more of its outputs are not just workslop.<\/p>\n<p>With that, here\u2019s more AI news.<\/p>\n<p>Jeremy Kahn<br \/><a href=\"https:\/\/fortune.com\/2025\/09\/30\/ai-models-are-already-as-good-as-experts-at-half-of-tasks-a-new-openai-benchmark-gdpval-suggests\/mailto:jeremy.kahn@fortune.com\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to mailto:jeremy.kahn@fortune.com\" class=\"sc-5ad7098d-0 lcJVdL\">jeremy.kahn@fortune.com<\/a><br \/>@jeremyakahn<\/p>\n<p>Before we get to the news, I want to call your attention to the Fortune AIQ 50, a new ranking which Fortune just published today that evaluates how Fortune 500 companies are doing in deploying AI. The ranking shows which companies, across 18 different sectors\u2014from financials to healthcare to retail\u2014are doing best when it comes to AI, as judged by both self-assessments and peer reviews. You can see the list <a href=\"https:\/\/fortune.com\/ranking\/aiq\/\" target=\"_self\" aria-label=\"Go to https:\/\/fortune.com\/ranking\/aiq\/\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">here<\/a>, and catch up on Fortune\u2019s ongoing AIQ series.<\/p>\n<p>FORTUNE ON AI<\/p>\n<p><a href=\"https:\/\/fortune.com\/2025\/09\/29\/openai-rolls-out-purchases-direct-from-chatgpt-in-a-radical-shift-to-e-commerce-and-direct-challenge-to-google\/\" target=\"_self\" aria-label=\"Go to https:\/\/fortune.com\/2025\/09\/29\/openai-rolls-out-purchases-direct-from-chatgpt-in-a-radical-shift-to-e-commerce-and-direct-challenge-to-google\/\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">OpenAI rolls out \u2018instant\u2019 purchases directly from ChatGPT, in a radical shift to e-commerce and a direct challenge to Google<\/a>\u2014by Jeremy Kahn<br \/><a href=\"https:\/\/fortune.com\/2025\/09\/29\/anthropic-releases-claude-sonnet-4-5-a-model-it-says-can-build-software-and-accomplish-business-tasks-autonomously\/\" target=\"_self\" aria-label=\"Go to https:\/\/fortune.com\/2025\/09\/29\/anthropic-releases-claude-sonnet-4-5-a-model-it-says-can-build-software-and-accomplish-business-tasks-autonomously\/\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">Anthropic releases Claude Sonnet 4.5, a model it says can build software and accomplish business tasks autonomously<\/a>\u2014by Beatrice Nolan<br \/><a href=\"https:\/\/fortune.com\/2025\/09\/28\/nvidia-openai-circular-financing-ai-bubble\/?utm_source=search&amp;utm_medium=suggested_search&amp;utm_campaign=search_link_clicks\" target=\"_self\" aria-label=\"Go to https:\/\/fortune.com\/2025\/09\/28\/nvidia-openai-circular-financing-ai-bubble\/?utm_source=search&amp;utm_medium=suggested_search&amp;utm_campaign=search_link_clicks\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">Nvidia\u2019s $100 billion OpenAI investment raises eyebrows and a key question: How much of the AI boom is just Nvidia\u2019s cash being recycled?<\/a>\u2014by Jeremy Kahn<\/p>\n<p><a href=\"https:\/\/fortune.com\/2025\/09\/29\/ford-ceo-jim-farley-blue-collar-worker-essential-economy-crisis-ai-data-centers\/\" target=\"_self\" aria-label=\"Go to https:\/\/fortune.com\/2025\/09\/29\/ford-ceo-jim-farley-blue-collar-worker-essential-economy-crisis-ai-data-centers\/\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">Ford CEO warns there\u2019s a dearth of blue-collar workers able to construct AI data centers and operate factories: \u2018Nothing to backfill the ambition\u2019<\/a>\u2014by Sasha Rogelberg<\/p>\n<p>EYE ON AI NEWS<\/p>\n<p>Meta locks in $14 billion worth of AI compute. The tech giant struck a $14 billion multi-year deal with CoreWeave to secure access to Nvidia GPUs (including next-gen GB300 systems). It\u2019s another sign of Big Tech\u2019s arms race for AI capacity. The pact follows CoreWeave\u2019s recent expansion tied to OpenAI and sent CoreWeave shares up. Read more from Reuters <a href=\"https:\/\/www.reuters.com\/technology\/coreweave-signs-14-billion-ai-deal-with-meta-bloomberg-news-reports-2025-09-30\/\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.reuters.com\/technology\/coreweave-signs-14-billion-ai-deal-with-meta-bloomberg-news-reports-2025-09-30\/\" class=\"sc-5ad7098d-0 lcJVdL\">here<\/a>.<\/p>\n<p>California governor signs landmark AI law. Governor Gavin Newsom signed SB 53 into law on Monday. The new AI legislation requires developers of high-end AI systems to publicly disclose safety plans and report serious incidents. The law also adds whistleblower protections for employees of AI companies and a public \u201cCalCompute\u201d cloud to broaden research access to AI. Large labs must outline how they mitigate catastrophic risks, with penalties for non-compliance. The measure\u2014authored by State Senator Scott Wiener\u2014follows last year\u2019s veto of a stricter bill that was roundly opposed by Silicon Valley heavyweights and AI companies. This time, some AI companies, such as Anthropic, as well as Elon Musk, supported SB 53, while Meta, Google and OpenAI opposed it. Read more from Reuters <a href=\"https:\/\/www.reuters.com\/legal\/litigation\/californias-newsom-signs-law-requiring-ai-safety-disclosures-2025-09-29\/\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.reuters.com\/legal\/litigation\/californias-newsom-signs-law-requiring-ai-safety-disclosures-2025-09-29\/\" class=\"sc-5ad7098d-0 lcJVdL\">here<\/a>.\u00a0\u00a0<\/p>\n<p>OpenAI\u2019s revenue surges\u2014but its burn rate remains dramatic. The AI company generated about $4.3 billion in the first half of 2025\u2014up 16% on all of 2024, according to financial details it disclosed to its investors and which were <a href=\"https:\/\/www.theinformation.com\/articles\/openais-first-half-results-4-3-billion-sales-2-5-billion-cash-burn?rc=xf7ohf\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.theinformation.com\/articles\/openais-first-half-results-4-3-billion-sales-2-5-billion-cash-burn?rc=xf7ohf\" class=\"sc-5ad7098d-0 lcJVdL\">reported by The Information<\/a>. But the company still had a burn rate of $2.5 billion over that same time period due to aggressive spending on R&amp;D and AI infrastructure. The company said it is targeting about $13 billion in revenue for 2025, but with a total cash burn of $8.5 billion. OpenAI is in the middle of a secondary share sale that could value the company at $500 billion, almost double its valuation of $260 billion at the start of the year.<\/p>\n<p>Apple is testing a stronger, still-secret model for Apple Intelligence. That\u2019s according to a report from <a href=\"https:\/\/www.bloomberg.com\/news\/articles\/2025-09-26\/apple-builds-a-chatgpt-like-app-to-help-test-the-revamped-siri?utm_medium=email&amp;utm_source=author_alert&amp;utm_term=250926&amp;utm_campaign=author_19842959\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.bloomberg.com\/news\/articles\/2025-09-26\/apple-builds-a-chatgpt-like-app-to-help-test-the-revamped-siri?utm_medium=email&amp;utm_source=author_alert&amp;utm_term=250926&amp;utm_campaign=author_19842959\" class=\"sc-5ad7098d-0 lcJVdL\">Bloomberg<\/a>, which cited unnamed sources it said were familiar with the matter. The news agency said Apple is trialing a ChatGPT-style app powered by an upgraded AI mode internally, with the aim to use it to overhaul its digital assistant Siri. The new chatbot would be rolled out as part of upcoming Apple Intelligence updates, Bloomberg said.<br \/>Opera launches Neon, an \u201cagentic\u201d AI browser. In a further sign that AI has rekindled the browser wars, the browser company Opera rolled out Neon, a browser with built-in AI that can execute multi-step tasks (think booking travel or generating code) from natural-language prompts. Opera is charging a subscription for Neon. It joins Perplexity\u2019s Comet and Google roll out of Gemini in Chrome in the increasingly competitive field of AI browsers. Read more from Tech Crunch <a href=\"https:\/\/techcrunch.com\/2025\/09\/30\/opera-launches-its-ai-centric-neon-browser\/?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/techcrunch.com\/2025\/09\/30\/opera-launches-its-ai-centric-neon-browser\/?utm_source=chatgpt.com\" class=\"sc-5ad7098d-0 lcJVdL\">here<\/a>.<br \/>Black Forest Labs in talks to raise $200 million to $300 million at $4 billion valuation.\u00a0That\u2019s according to a <a href=\"https:\/\/www.ft.com\/content\/299ddb72-1ae7-4691-bf6b-cd0229bbb04f?utm_source=chatgpt.com\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.ft.com\/content\/299ddb72-1ae7-4691-bf6b-cd0229bbb04f?utm_source=chatgpt.com\" class=\"sc-5ad7098d-0 lcJVdL\">story<\/a> in the Financial Times. It says the somewhat secretive German image-generation startup (makers of the Flux models and founded by ex-Stable Diffusion employees) is negotiating a new venture capital round that would value the company around $4 billion, up from roughly $1 billion last year. The round would mark one of Europe\u2019s largest recent AI financings and underscores investor appetite for next-generation visual models.\u00a0<\/p>\n<p>EYE ON AI RESEARCH<\/p>\n<p>Can an AI model beat VCs at spotting winning startups? Yes, it can, according to a new study conducted by researchers from the University of Oxford and AI startup Vela Research\/ They created a new assessment they call VCBench, built from 9,000 anonymized founder profiles, to evaluate if LLMs can predict startup success better than human investors. (Of these 9,000 founders, 9% went on to see their companies either get acquired, raise more than $500 million in funding, or IPO at more than a $500 million valuation.) In their tests, some models far out-performed the record of venture capital firms, which in general pick a winner about one in every 20 bets they make. OpenAI\u2019s GPT-5 scored a winner about half the time, while DeepSeek-V3 was the most accurate, selecting winners six out of every 10 times, and doing so at a lower cost than most other models. Interestingly, a different machine learning technique from Vela, called reasoned rule mining, was more accurate still, hitting a winner 87.5% of the time. (The researchers also tried to ensure that the LLMs were not simply finding a clever way to re-identify the people whose anonymized profiles make up the dataset and cheat by simply looking up what had happened to their companies. The researchers say they were able to reduce this chance to the point where it was unlikely to be the case.) The researchers are publishing a public leaderboard at <a href=\"http:\/\/vcbench.com\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to http:\/\/vcbench.com\" class=\"sc-5ad7098d-0 lcJVdL\">vcbench.com<\/a>. You can read more about the research <a href=\"https:\/\/arxiv.org\/abs\/2509.14448?utm_source=substack&amp;utm_medium=email\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/arxiv.org\/abs\/2509.14448?utm_source=substack&amp;utm_medium=email\" class=\"sc-5ad7098d-0 lcJVdL\">here<\/a> on arxiv.org and in the Financial Times <a href=\"https:\/\/www.ft.com\/content\/7314c8de-98f3-4e65-83e6-b31a358bf4bc\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.ft.com\/content\/7314c8de-98f3-4e65-83e6-b31a358bf4bc\" class=\"sc-5ad7098d-0 lcJVdL\">here<\/a>.<\/p>\n<p>AI CALENDAR<\/p>\n<p>Oct. 6: OpenAI DevDay, San Francisco\n<\/p>\n<p>Oct. 6-10:\u00a0World\u00a0AI\u00a0Week, Amsterdam\n<\/p>\n<p>Oct. 21-22: TedAI San Francisco.\n<\/p>\n<p>Nov. 10-13:\u00a0Web Summit, Lisbon.\u00a0<\/p>\n<p>Nov. 26-27:\u00a0World AI Congress, London.\n<\/p>\n<p>Dec. 2-7:\u00a0NeurIPS, San Diego\n<\/p>\n<p>Dec. 8-9:\u00a0Fortune Brainstorm AI San Francisco. Apply to attend <a href=\"https:\/\/conferences.fortune.com\/event\/brainstorm-ai-2025\/HOME\" target=\"_self\" aria-label=\"Go to https:\/\/conferences.fortune.com\/event\/brainstorm-ai-2025\/HOME\" class=\"sc-5ad7098d-0 lcJVdL\" rel=\"nofollow noopener\">here<\/a>.\n<\/p>\n<p>BRAIN FOOD<\/p>\n<p>Are world models and reinforcement learning all we need? There was a big controversy among AI researchers and other industry insiders this past week over the appearance of Turing Award-winner and AI research legend Rich Sutton on<a href=\"https:\/\/www.dwarkesh.com\/p\/richard-sutton\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/www.dwarkesh.com\/p\/richard-sutton\" class=\"sc-5ad7098d-0 lcJVdL\"> the Dwarkesh podcast<\/a>. Sutton argued that LLMs are actually a dead end that will never achieve AGI because they can only ever imitate human knowledge and they don\u2019t construct a \u201cworld model\u201d\u2014a way of predicting what will happen next based on an intuitive understanding of things such as the laws of physics or, even, human nature. Dwarkesh pushed back, suggesting to Sutton that LLMs did, in fact, have a kind of world model, but Sutton was having none of it.<br \/>Some\u2014such as AI skeptic Gary Marcus\u2013<a href=\"https:\/\/garymarcus.substack.com\/p\/game-over-for-pure-llms-even-turing\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to https:\/\/garymarcus.substack.com\/p\/game-over-for-pure-llms-even-turing\" class=\"sc-5ad7098d-0 lcJVdL\">interpreted<\/a> what Sutton said on Dwarkesh as a major reversal from the position he had taken in a famous essay, \u201cThe Bitter Lesson,\u201d <a href=\"http:\/\/www.incompleteideas.net\/IncIdeas\/BitterLesson.html\" target=\"_blank\" rel=\"noopener nofollow\" aria-label=\"Go to http:\/\/www.incompleteideas.net\/IncIdeas\/BitterLesson.html\" class=\"sc-5ad7098d-0 lcJVdL\">published<\/a> in 2019, which argued that progress in AI mostly depended on using the same basic algorithms but simply throwing more compute and more data at them, rather than any clever algorithmic innovation. \u201cThe Bitter Lesson\u201d has been waved like a bloody flag by those who have argued that \u201cscale is all we need\u201d\u2014building ever bigger LLMs on ever larger GPU clusters\u2014to achieve AGI.<\/p>\n<p>But Sutton never wrote explicitly about LLMs in &#8220;The Bitter Lesson&#8221; and I don\u2019t think his Dwarkesh remarks should be interpreted as a departure from his position. Instead, Sutton has always been first and foremost an advocate of reinforcement learning in environments where the reward signal comes entirely from the environment, with an AI model acting agentically and acquiring experience\u2014building a model of \u201cthe rules of the game\u201d as well as the most rewarding actions in any given situation. Sutton doesn\u2019t like the way LLMs are trained, with unsupervised learning from human text followed by a kind of RL using human feedback\u2014because everything the LLM can learn is inherently limited by human knowledge and human preferences. He has always been an advocate for the idea of pure tabula rasa learning. To Sutton, LLMs are a big departure from tabula rasa, and so it is not surprising he sees them as a dead end to AGI.\u00a0<\/p>\n","protected":false},"excerpt":{"rendered":"Hello and welcome to Eye on AI. In this edition\u2026A new OpenAI benchmark shows how good models are&hellip;\n","protected":false},"author":2,"featured_media":192945,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[182,181,507,4320,3195,110,36133,1929,1283,74],"class_list":{"0":"post-192944","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-chatbots","12":"tag-chatgpt","13":"tag-economics","14":"tag-eye-on-ai","15":"tag-labor","16":"tag-openai","17":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/192944","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=192944"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/192944\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/192945"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=192944"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=192944"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=192944"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}