Photo-Illustration: Intelligencer; Photo: Getty Images
You can divide the recent history of LLM data scraping into a few phases. There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts. Once apps like ChatGPT became popular and companies started commercializing models, the matter of training data became instantly and extremely contentious.
Authors, filmmakers, musicians, and major publishers and internet companies started calling out AI firms and filing lawsuits. OpenAI started making individual deals with publishers and platforms — including Reddit and New York’s owner company, Vox Media — to ensure ongoing access to data for training and up-to-date chat content, while other companies, including Google and Amazon, entered into licensing deals of their own. Despite these deals and legal battles, however, AI scraping became only more widespread and brazen, leaving the rest of the web to wonder what, exactly, is supposed to happen next.
They’re up against sophisticated actors. Lavishly funded start-ups and tech megafirms are looking for high-quality data wherever they can find it, offline and on, and web scraping has turned into an arms race. There are scrapers masquerading as search engines or regular users, and blocked companies are building undercover crawlers. Website operators, accustomed to having at least nominal control over whether search engines index their content, are seeing the same thing in their data: swarms of voracious machines making constant attempts to harvest their content, spamming them with billions of requests. Internet infrastructure providers are saying the same thing: AI crawlers are going for broke. A leaked list of sites allegedly scraped by Meta, obtained by Drop Site News, includes “copyrighted content, pirated content, and adult videos, some of whose content is potentially illegally obtained or recorded, as well as news and original content from prominent outlets and content publishers.” This is neither surprising nor unique to one company. It’s closer to industry-standard practice.
For decades, the most obvious reason to crawl the web was to build a useful index or, later, a search engine like Google. A Google crawl meant you had a chance to show up in search results and actual people might visit your website. AI crawlers offer a different proposition. They come, they crawl, and they copy. Then they use that copied data to build products that in many cases compete with their sources (see: Wikipedia or any news site) and at most offer in return footnoted links few people will follow (see: ChatGPT Search and Google’s AI Mode). For an online-publishing ecosystem already teetering on the edge of collapse, such an arrangement looks profoundly grim. AI firms scraped the web to build models that will continue to scrape the web until there’s nothing left.
In June, Cloudflare, an internet infrastructure firm that handles a significant portion of online traffic, announced a set of tools for tracking AI scraping and plans to build a “marketplace” that would allow sites to set prices for “accessing and taking their content to ingest into these systems.” This week, a group of online organizations and websites — including Reddit, Medium, Quora, and Cloudflare competitor Fastly — announced the RSL standard, short for Really Simply Licensing (a reference to RSS, or Really Simple Syndication, some co-creators of which are involved in the effort). The idea is simple: With search engines, publishers could indicate whether they wanted to be indexed, and major search engines usually obliged; now, under more antagonistic circumstances, anyone who hosts content will be able to indicate not just whether the content can be scraped but how it should be attributed and, crucially, how much they want to charge for its use, either individually or as part of a coordinated group.
As far as getting major AI firms to pay up, not to mention the hundreds of smaller firms that are also scraping, RSL is clearly an aspirational effort, and I doubt the first step here is for Meta or OpenAI to instantly cave and start paying royalties to WebMD. Combined with the ability to use services like Cloudflare and Fastly to more effectively block AI firms, though, it does mark the beginning of a potentially major change. For most websites, AI crawling has so far been a net negative, and there isn’t much to lose by shutting it down (with the exception of Google, which crawls for its Search and AI products using the same tools). Now, with the backing of internet infrastructure firms that can actually keep pace with big tech’s scraping tactics, they can. (Tech giants haven’t been above scraping one another’s content, but they’re far better equipped to stop it if they want to.)
A world in which a majority of public websites become invisible to AI firms by default is a world in which firms that have depended on relatively unfettered access to the web could start hurting for up-to-date information, be it breaking news, fresh research, new products, or just ambient culture and memes. They may not be inclined to pay everyone, but they may eventually be forced to pay someone, through RSL or otherwise.
Sign Up for John Herrman column alerts
Get an email alert as soon as a new article publishes.
Vox Media, LLC Terms and Privacy Notice