Eight in ten of world's biggest news websites now block AI training bots

Robot with big red cross in front of it and a sad look on its face, to represent AI bots being blocked by publishers
Picture: Shutterstock/Ketmut

Almost eight in ten of the biggest news websites in the UK and US are now blocking AI training bots.

This week in London publishers were told it is “never too late” to begin blocking the bots due to the frequency with which new crawling needs to take place via retrieval-augmented generation so that AI chatbot responses remain current.

Some 79% of almost 100 top news websites in the UK and US are blocking at least one crawler used for AI training out of OpenAI’s GPTBot, ClaudeBot, Anthropic-ai, CCBot, Applebot-Extended and Google-Extended.

Meanwhile 71% are blocking AI bots from crawling their sites for retrieval or live searches. Those bots are: ChatGPT-User, Claude-Web, Perplexity-User and OAI-SearchBot.

The latest analysis of who was blocking what was carried out by digital PR platform Buzzstream, looking at a combined and deduplicated list of the 50 biggest news websites in each of the UK and US, and shared with Press Gazette.

Among the 50 biggest, the publishers that had blocked all the AI bots included in the analysis were: the BBC (both its .co.uk and .com domains), The New York Times, the Daily Mail, The Telegraph, Sky News, AP News, New York Post, Newsweek, NBC News, Wall Street Journal, Metro, Business Insider, ABC News, Buzzfeed, Huffpost and The Hill. That means 34% of the top 50 blocked all the bots.

Some 14% of the top-50 publishers allowed access to all 11 AI crawlers analysed: Fox News, The Independent, GB News, Substack, the Standard, the Drudge Report and Politico.

Telegraph SEO director Harry Clarkson-Bennett told Buzzstream: “Publishers are blocking AI bots using the robots.txt because there’s almost no value exchange.

“LLMs are not designed to send referral traffic and publishers (still) need traffic to survive.

“So most of us block AI bots because these companies are not willing to pay for the content their model has been trained on and their output is almost entirely internal.”

Across the 50 biggest sites analysed, Perplexity-User was the most permitted (by 32) followed by ChatGPT-User (24). Both are retrieval/live search crawlers that determine whether Perplexity or ChatGPT can access a site in real time to provide answers to user questions. Conversely, Perplexity-Bot surfaces and links to websites within search results on Perplexity.

The least permitted was Anthropic’s training bot (eight) followed by the Common Crawl web archive bot CCBot (nine).

In February 2024 Press Gazette analysis showed that 61 out of the 106 biggest news websites in the UK and US were blocking at least one AI crawler, showing an increase in the past two years.

Google Extended “allows you to say yes to search, no to AI training”, Google’s senior manager for government affairs and public policy Roxanne Carter told UK Parliament last week.

Google Extended lets publishers opt out from AI chatbot Gemini and AI development platform Vertex scraping their content, but does not stop sites from being accessed and used in Google’s AI Overviews.

To avoid this publishers would have to opt out of being scraped by Googlebot, which indexes for search. Googlebot was not included in the new Buzzstream analysis but opting out of it has been described as an “unenviable choice”.

Buzzstream found that US publishers are “much more aggressive” in blocking Google Extended (58%) than UK publishers (29%).

Among the 100 top news sites in the US and UK, American sites are stricter on training bots (81% blocking) than their UK colleagues (77%).

Robots.txt is not a foolproof way of blocking AI bots as their directives can be ignored by AI companies or circumvented in ways such as using third-party companies to scrape content instead.

Typos and misconfigurations may stop publishers blocking effectively

Anthony Katsur, CEO of IAB Tech Lab, told the Interactive Advertising Bureau (IAB)’s Techtonic event in London on Wednesday that they had “scoured a large percentage” of publisher commands to AI bots and found “there’s a lot of typos or a lot of misconfigurations of robots.txt, which means that may not be obeyed – not that it’s always obeyed anyway”.

Katsur recommended publishers block AI bots, even if they have not yet, because of retrieval-augmented generation which means generative AI models retrieve and reference new information from the web in real time (rather than relying on one initial discovery crawl).

Katsur said: “Publishers, if you have not blocked and you think that it is too late to block, you are wrong. It is never too late to block.

“The LLMs will come back and they will recrawl content in order for that information to stay fresh and relevant and accurate.”

He also said: “In the history of mankind, no marketplace has ever been created when things are given away for free or just stolen. So without scarcity, marketplaces won’t exist. So publishers, if you are not blocking, then it is unlikely that some sort of content marketplace will take shape.”

Anthony Katsur, CEO, IAB Tech Lab speaking at IAB UK Techtonic in London on Wednesday 21 January 2026. Picture: Ben Stevens Photography/IAB UK

Katsur suggested that publishers should collectively “link arms” and block the bots “full stop”, even if only for 72 hours, “just to show that there’s strength amongst publishers, big and small”.

He said that if publishers were threatened with an antitrust or collusion lawsuit as a result, “that would be a good problem to have because it would mean you’re still in business”.

Katsur said the IAB Tech Lab believes the “days of crawling” are not “sustainable for publishers”.

“We also don’t necessarily think it’s sustainable for LLMs… as we see the AI bubble burst, I think from a fiscal or a funding perspective, easy access to ongoing purchase of GPUs [graphics processing units] to fund the LLMs will become constrained, therefore you will need greater efficiencies in terms of how content is accessed and structured for more efficient LLMs.”

The IAB Tech Lab is working on an alternative Content Monetisation Protocols (CoMP) initiative that, Katsur said, would involve a “new set of open source APIs that would enable LLMs to access publisher or brand content only under strict controls and structures, thereby protecting your intellectual property, your copyright, and again, also providing LLMs with a well structured format for more efficient consumption, use and accuracy”.

A CoMP working group held in October featured Google, Microsoft and Meta taking a seat at the table alongside publishers. But Katsur said they have “had absolute crickets from OpenAI, Anthropic, Perplexity and some of the other smaller ones”.

Email pged@pressgazette.co.uk to point out mistakes, provide story tips or send in a letter for publication on our “Letters Page” blog

Eight in ten of world’s biggest news websites now block AI training bots

Tags: