{"id":384605,"date":"2026-01-22T18:00:11","date_gmt":"2026-01-22T18:00:11","guid":{"rendered":"https:\/\/www.newsbeep.com\/uk\/384605\/"},"modified":"2026-01-22T18:00:11","modified_gmt":"2026-01-22T18:00:11","slug":"eight-in-ten-of-worlds-biggest-news-websites-now-block-ai-training-bots","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/uk\/384605\/","title":{"rendered":"Eight in ten of world&#8217;s biggest news websites now block AI training bots"},"content":{"rendered":"<p>        <img width=\"1038\" height=\"778\" src=\"https:\/\/www.newsbeep.com\/uk\/wp-content\/uploads\/2026\/01\/shutterstock_2680036993-e1769012550733-1038x778.webp.webp\" class=\"attachment-4x3-large-crop size-4x3-large-crop wp-post-image\" alt=\"Robot with big red cross in front of it and a sad look on its face, to represent AI bots being blocked by publishers\" decoding=\"async\" fetchpriority=\"high\"  \/><br \/>\n                Picture: Shutterstock\/Ketmut<\/p>\n<p>Almost eight in ten of the biggest news websites in the UK and US are now blocking <a href=\"https:\/\/pressgazette.co.uk\/subject\/artificial-intelligence\/\" target=\"_blank\" rel=\"noopener nofollow\">AI<\/a> training bots.<\/p>\n<p>This week in London publishers were told it is \u201cnever too late\u201d to begin blocking the bots due to the frequency with which new crawling needs to take place via retrieval-augmented generation so that AI chatbot responses remain current.<\/p>\n<p>Some 79% of almost 100 top news websites in the UK and US are blocking at least one crawler used for AI training out of OpenAI\u2019s GPTBot, ClaudeBot, Anthropic-ai, CCBot, Applebot-Extended and Google-Extended.<\/p>\n<p>Meanwhile 71% are blocking AI bots from crawling their sites for retrieval or live searches. Those bots are: ChatGPT-User, Claude-Web, Perplexity-User and OAI-SearchBot.<\/p>\n<p>The latest analysis of who was blocking what was carried out by digital PR platform Buzzstream, looking at a combined and deduplicated list of the 50 biggest news websites in each of <a href=\"https:\/\/pressgazette.co.uk\/media-audience-and-business-data\/media_metrics\/most-popular-websites-news-uk-monthly-2\/\" target=\"_blank\" rel=\"noopener nofollow\">the UK<\/a> <a href=\"https:\/\/pressgazette.co.uk\/media-audience-and-business-data\/media_metrics\/most-popular-websites-news-us-monthly-3\/\" target=\"_blank\" rel=\"noopener nofollow\">and US<\/a>, and shared with Press Gazette.<\/p>\n<p>Among the 50 biggest, the publishers that had blocked all the AI bots included in the analysis were: the BBC (both its .<a href=\"https:\/\/co.uk\" rel=\"nofollow noopener\" target=\"_blank\">co.uk<\/a> and .com domains), The New York Times, the Daily Mail, The Telegraph, Sky News, AP News, New York Post, Newsweek, NBC News, Wall Street Journal, Metro, Business Insider, ABC News, Buzzfeed, Huffpost and The Hill. That means 34% of the top 50 blocked all the bots.<\/p>\n<p>Some 14% of the top-50 publishers allowed access to all 11 AI crawlers analysed: Fox News, The Independent, GB News, Substack, the Standard, the Drudge Report and Politico.<\/p>\n<p>Telegraph SEO director Harry Clarkson-Bennett told Buzzstream: \u201cPublishers are blocking AI bots using the robots.txt because there\u2019s <a href=\"https:\/\/pressgazette.co.uk\/publishers\/digital-journalism\/ai-bots-bombard-publisher-websites-with-no-meaningful-value-exchange\/\" target=\"_blank\" rel=\"noopener nofollow\">almost no value exchange.<\/a><\/p>\n<p>\u201cLLMs are not designed to send referral traffic and publishers (still) need traffic to survive.<\/p>\n<p>\u201cSo most of us block AI bots because these companies are not willing to pay for the content their model has been trained on and their output is almost entirely internal.\u201d<\/p>\n<p>Across the 50 biggest sites analysed, Perplexity-User was the most permitted (by 32) followed by ChatGPT-User (24). Both are retrieval\/live search crawlers that determine whether Perplexity or ChatGPT can access a site in real time to provide answers to user questions. Conversely, Perplexity-Bot surfaces and links to websites within search results on Perplexity.<\/p>\n<p>The least permitted was Anthropic\u2019s training bot (eight) followed by the Common Crawl web archive bot CCBot (nine).<\/p>\n<p>In February 2024 Press Gazette analysis showed that 61 out of the 106 biggest news websites in the UK and US <a href=\"https:\/\/pressgazette.co.uk\/platforms\/news-sites-block-ai-web-crawlers-chatgpt-google\/\" target=\"_blank\" rel=\"noopener nofollow\">were blocking at least one AI crawler<\/a>, showing an increase in the past two years.<\/p>\n<p>Google Extended \u201callows you to say yes to search, no to AI training\u201d, <a href=\"https:\/\/pressgazette.co.uk\/platforms\/google-ai-training-pay-opt-out-publishers\/\" target=\"_blank\" rel=\"noopener nofollow\">Google\u2019s senior manager for government affairs and public policy Roxanne Carter told UK Parliament last week.<\/a><\/p>\n<p>Google Extended lets publishers opt out from AI chatbot Gemini and AI development platform Vertex scraping their content, but does not stop sites from being accessed and used in Google\u2019s AI Overviews.<\/p>\n<p>To avoid this publishers would have to opt out of being scraped by Googlebot, which indexes for search. Googlebot was not included in the new Buzzstream analysis but opting out of it <a href=\"https:\/\/pressgazette.co.uk\/media_law\/google-ai-scraping-crawlers-financial-times-news-publishers\/\" target=\"_blank\" rel=\"noopener nofollow\">has been described as an \u201cunenviable choice\u201d.<\/a><\/p>\n<p>Buzzstream found that US publishers are \u201cmuch more aggressive\u201d in blocking Google Extended (58%) than UK publishers (29%).<\/p>\n<p>Among the 100 top news sites in the US and UK, American sites are stricter on training bots (81% blocking) than their UK colleagues (77%).<\/p>\n<p>Robots.txt is not a foolproof way of blocking AI bots as their directives can be ignored by AI companies or circumvented in ways such as <a href=\"https:\/\/pressgazette.co.uk\/platforms\/third-party-scrapers-are-stealing-publisher-content-to-order-for-ai-companies\/\" target=\"_blank\" rel=\"noopener nofollow\">using third-party companies to scrape content instead.<\/a><\/p>\n<p>Typos and misconfigurations may stop publishers blocking effectively<\/p>\n<p>Anthony Katsur, CEO of IAB Tech Lab, told the Interactive Advertising Bureau (IAB)\u2019s Techtonic event in London on Wednesday that they had \u201cscoured a large percentage\u201d of publisher commands to AI bots and found \u201cthere\u2019s a lot of typos or a lot of misconfigurations of robots.txt, which means that may not be obeyed \u2013 not that it\u2019s always obeyed anyway\u201d.<\/p>\n<p>Katsur recommended publishers block AI bots, even if they have not yet, because of retrieval-augmented generation which means generative AI models retrieve and reference new information from the web in real time (rather than relying on one initial discovery crawl).<\/p>\n<p>Katsur said: \u201cPublishers, if you have not blocked and you think that it is too late to block, you are wrong. It is never too late to block.<\/p>\n<p>\u201cThe LLMs will come back and they will recrawl content in order for that information to stay fresh and relevant and accurate.\u201d<\/p>\n<p>He also said: \u201cIn the history of mankind, no marketplace has ever been created when things are given away for free or just stolen. So without scarcity, marketplaces won\u2019t exist. So publishers, if you are not blocking, then it is unlikely that some sort of content marketplace will take shape.\u201d<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"583\" src=\"https:\/\/www.newsbeep.com\/uk\/wp-content\/uploads\/2026\/01\/iab_techtonic_2026_021-800x583.webp.webp\" alt=\"Anthony Katsur, CEO, IAB Tech Lab speaking on stage with close-up words on presentation behind him\" class=\"wp-image-247968\"  \/>Anthony Katsur, CEO, IAB Tech Lab speaking at IAB UK Techtonic in London on Wednesday 21 January 2026. Picture: Ben Stevens Photography\/IAB UK<\/p>\n<p>Katsur suggested that publishers should collectively \u201clink arms\u201d and block the bots \u201cfull stop\u201d, even if only for 72 hours, \u201cjust to show that there\u2019s strength amongst publishers, big and small\u201d.<\/p>\n<p>He said that if publishers were threatened with an antitrust or collusion lawsuit as a result, \u201cthat would be a good problem to have because it would mean you\u2019re still in business\u201d.<\/p>\n<p>Katsur said the IAB Tech Lab believes the \u201cdays of crawling\u201d are not \u201csustainable for publishers\u201d.<\/p>\n<p>\u201cWe also don\u2019t necessarily think it\u2019s sustainable for LLMs\u2026 as we see the AI bubble burst, I think from a fiscal or a funding perspective, easy access to ongoing purchase of GPUs [graphics processing units] to fund the LLMs will become constrained, therefore you will need greater efficiencies in terms of how content is accessed and structured for more efficient LLMs.\u201d<\/p>\n<p>The IAB Tech Lab is <a href=\"https:\/\/iabtechlab.com\/press-releases\/iab-tech-lab-forms-ai-content-monetization-protocols-comp-working-group-to-set-ai-era-publisher-monetization-standards\/\" target=\"_blank\" rel=\"noopener nofollow\">working on an alternative Content Monetisation Protocols (CoMP) initiative<\/a> that, Katsur said, would involve a \u201cnew set of open source APIs that would enable LLMs to access publisher or brand content only under strict controls and structures, thereby protecting your intellectual property, your copyright, and again, also providing LLMs with a well structured format for more efficient consumption, use and accuracy\u201d.<\/p>\n<p>A CoMP working group held in October featured Google, Microsoft and Meta taking a seat at the table alongside publishers. But Katsur said they have \u201chad absolute crickets from OpenAI, Anthropic, Perplexity and some of the other smaller ones\u201d.<\/p>\n<p>\n    &#13;\n<\/p>\n<p>Email <a href=\"https:\/\/pressgazette.co.uk\/platforms\/eight-in-ten-of-worlds-biggest-news-websites-now-block-ai-training-bots\/mailto: pged@pressgazette.co.uk\" rel=\"nofollow noopener\" target=\"_blank\">pged@pressgazette.co.uk<\/a> to point out mistakes, provide story tips or send in a letter for publication on our &#8220;Letters Page&#8221; blog <\/p>\n","protected":false},"excerpt":{"rendered":"Picture: Shutterstock\/Ketmut Almost eight in ten of the biggest news websites in the UK and US are now&hellip;\n","protected":false},"author":2,"featured_media":384606,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[554,304,733,4308,243,1921,47370,22669,844,1996,2712,86,141749,56,54,55],"class_list":{"0":"post-384605","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-apple","10":"tag-artificial-intelligence","11":"tag-artificialintelligence","12":"tag-bbc","13":"tag-chatgpt","14":"tag-daily-mail","15":"tag-fox-news","16":"tag-google","17":"tag-new-york-times","18":"tag-pwfree","19":"tag-technology","20":"tag-telegraph-media-group","21":"tag-uk","22":"tag-united-kingdom","23":"tag-unitedkingdom"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/384605","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/comments?post=384605"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/posts\/384605\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media\/384606"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/media?parent=384605"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/categories?post=384605"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/uk\/wp-json\/wp\/v2\/tags?post=384605"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}