{"id":21012,"date":"2025-09-14T15:54:13","date_gmt":"2025-09-14T15:54:13","guid":{"rendered":"https:\/\/www.newsbeep.com\/ie\/21012\/"},"modified":"2025-09-14T15:54:13","modified_gmt":"2025-09-14T15:54:13","slug":"ai-scraping-free-for-all-by-openai-google-and-meta-is-over","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/ie\/21012\/","title":{"rendered":"AI-Scraping Free-for-All by OpenAI, Google, and Meta Is Over"},"content":{"rendered":"<p>                  <img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/ie\/wp-content\/uploads\/2025\/09\/7a7414080e993fa22f82a9f9f0c56f82cb-altman-AIpays.rhorizontal.w1100.jpg\" class=\"lede-image\" data-content-img=\"\" width=\"1100\" height=\"733\" style=\"width:100%;height:auto;\" fetchpriority=\"high\"\/> <\/p>\n<p>\n                  Photo-Illustration: Intelligencer; Photo: Getty Images\n              <\/p>\n<p class=\"clay-paragraph\" data-editable=\"text\" data-uri=\"nymag.com\/intelligencer\/_components\/clay-paragraph\/instances\/cmfh8sirn000j0igvr0hxofzb@published\" data-word-count=\"63\">You can divide the recent history of LLM data scraping into a few phases. There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts. Once apps like ChatGPT became popular and companies started commercializing models, the matter of training data became instantly and extremely contentious.<\/p>\n<p class=\"clay-paragraph\" data-editable=\"text\" data-uri=\"nymag.com\/intelligencer\/_components\/clay-paragraph\/instances\/cmfh9k3gc000c3b78w6wgrg7s@published\" data-word-count=\"93\">Authors, filmmakers, musicians, and major publishers and internet companies started calling out AI firms and <a href=\"https:\/\/www.nytimes.com\/2023\/12\/27\/business\/media\/new-york-times-open-ai-microsoft-lawsuit.html\" rel=\"nofollow noopener\" target=\"_blank\">filing lawsuits<\/a>. OpenAI started making individual <a href=\"https:\/\/nymag.com\/intelligencer\/2023\/12\/what-do-ai-companies-want-with-the-media.html\" rel=\"nofollow noopener\" target=\"_blank\">deals<\/a> with publishers and platforms \u2014 <a href=\"https:\/\/nymag.com\/intelligencer\/article\/why-you-are-reading-reddit-a-lot-more-these-days.html\" rel=\"nofollow noopener\" target=\"_blank\">including Reddit<\/a> and New York\u2019s owner company, Vox Media \u2014 to ensure ongoing access to data for training and up-to-date chat content, while other companies, including Google and Amazon, entered into <a href=\"https:\/\/nymag.com\/intelligencer\/article\/why-you-are-reading-reddit-a-lot-more-these-days.html\" rel=\"nofollow noopener\" target=\"_blank\">licensing deals<\/a> of their own. Despite these deals and legal battles, however, AI scraping became only more widespread and brazen, leaving the rest of the web to wonder what, exactly, is supposed to happen next.<\/p>\n<p class=\"clay-paragraph\" data-editable=\"text\" data-uri=\"nymag.com\/intelligencer\/_components\/clay-paragraph\/instances\/cmfh8ur9k000t3b785troe4nb@published\" data-word-count=\"164\">They\u2019re up against sophisticated actors. Lavishly funded start-ups and tech megafirms are looking for high-quality data wherever they can find it, offline and on, and web scraping has turned into an arms race. There are scrapers <a href=\"https:\/\/www.akamai.com\/blog\/security\/rise-llm-ai-scrapers-bot-management\" rel=\"nofollow noopener\" target=\"_blank\">masquerading<\/a> as search engines or regular users, and blocked companies are building undercover crawlers. Website operators, accustomed to having at least nominal control over whether search engines index their content, are <a href=\"https:\/\/www.nna.org\/death-by-a-million-scrapes-what-publishers-face-with-the-rapid-increase-in-ai-scraping\" rel=\"nofollow noopener\" target=\"_blank\">seeing<\/a> the same thing in their data: swarms of voracious machines making constant attempts to harvest their content, spamming them with billions of requests. Internet infrastructure providers are saying the same thing: AI crawlers are going for broke. A leaked list of sites allegedly scraped by Meta, <a href=\"https:\/\/www.dropsitenews.com\/p\/meta-facebook-tech-copyright-privacy-whistleblower\" rel=\"nofollow noopener\" target=\"_blank\">obtained<\/a> by Drop Site News, includes \u201ccopyrighted content, pirated content, and adult videos, some of whose content is potentially illegally obtained or recorded, as well as news and original content from prominent outlets and content publishers.\u201d This is neither surprising nor unique to one company. It\u2019s closer to industry-standard practice.<\/p>\n<p class=\"clay-paragraph\" data-editable=\"text\" data-uri=\"nymag.com\/intelligencer\/_components\/clay-paragraph\/instances\/cmfh8ur9p000u3b78fqc2ai47@published\" data-word-count=\"135\">For decades, the most obvious reason to crawl the web was to build a useful index or, later, a search engine like Google. A Google crawl meant you had a chance to show up in search results and actual people might visit your website. AI crawlers offer a different proposition. They come, they crawl, and they copy. Then they use that copied data to build products that in many cases compete with their sources (see: Wikipedia or any news site) and at most offer in return footnoted links few people will follow (see: ChatGPT Search and Google\u2019s AI Mode). For an online-publishing ecosystem already teetering on the edge of collapse, such an arrangement looks profoundly grim. AI firms scraped the web to build models that will <a href=\"https:\/\/nymag.com\/intelligencer\/article\/ai-ate-the-web-now-its-coming-back-for-seconds.html\" rel=\"nofollow noopener\" target=\"_blank\">continue to scrape the web<\/a> until there\u2019s nothing left.<\/p>\n<p class=\"clay-paragraph\" data-editable=\"text\" data-uri=\"nymag.com\/intelligencer\/_components\/clay-paragraph\/instances\/cmfh8ur9p000v3b78e1emudnx@published\" data-word-count=\"161\">In June, Cloudflare, an internet infrastructure firm that handles a significant portion of online traffic, announced a set of tools for tracking AI scraping and plans to build a \u201c<a href=\"https:\/\/techcrunch.com\/2024\/09\/23\/cloudflares-new-marketplace-will-let-websites-charge-ai-bots-for-scraping\/\" rel=\"nofollow noopener\" target=\"_blank\">marketplace<\/a>\u201d that would allow sites to set prices for \u201caccessing and taking their content to ingest into these systems.\u201d This week, a group of online organizations and websites \u2014\u00a0including Reddit, Medium, Quora, and Cloudflare competitor Fastly \u2014\u00a0announced the <a href=\"https:\/\/rslstandard.org\/\" rel=\"nofollow noopener\" target=\"_blank\">RSL standard<\/a>, short for Really Simply Licensing (a reference to RSS, or Really Simple Syndication, some co-creators of which are involved in the effort). The idea is simple: With search engines, publishers could indicate whether they wanted to be indexed, and major search engines usually obliged; now, under more antagonistic circumstances, anyone who hosts content will be able to indicate not just whether the content can be scraped but how it should be attributed and, crucially, how much they want to charge for its use, either individually or as part of a coordinated group.<\/p>\n<p class=\"clay-paragraph\" data-editable=\"text\" data-uri=\"nymag.com\/intelligencer\/_components\/clay-paragraph\/instances\/cmfh8uryj000w3b787x4ggkfe@published\" data-word-count=\"157\">As far as getting major AI firms to pay up, not to mention the hundreds of smaller firms that are also scraping, RSL is clearly an aspirational effort, and I doubt the first step here is for Meta or OpenAI to instantly cave and start paying royalties to WebMD. Combined with the ability to use services like Cloudflare and Fastly to more effectively block AI firms, though, it does mark the beginning of a potentially major change. For most websites, AI crawling has so far been a net negative, and there isn\u2019t much to lose by shutting it down (with the <a href=\"https:\/\/blog.cloudflare.com\/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025\/\" rel=\"nofollow noopener\" target=\"_blank\">exception<\/a> of Google, which crawls for its Search and AI products using the same tools). Now, with the backing of internet infrastructure firms that can actually keep pace with big tech\u2019s scraping tactics, they can. (Tech giants haven\u2019t been above <a href=\"https:\/\/qz.com\/openai-chatgpt-youtube-videos-gpt4-google-ai-1851394868\" rel=\"nofollow noopener\" target=\"_blank\">scraping<\/a> one another\u2019s content, but they\u2019re far better equipped to stop it if they want to.)<\/p>\n<p class=\"clay-paragraph\" data-editable=\"text\" data-uri=\"nymag.com\/intelligencer\/_components\/clay-paragraph\/instances\/cmfh8uryk000x3b78i03mtap6@published\" data-word-count=\"73\">A world in which a majority of public websites become invisible to AI firms by default is a world in which firms that have depended on relatively unfettered access to the web could start hurting for up-to-date information, be it breaking news, fresh research, new products, or just ambient culture and memes. They may not be inclined to pay everyone, but they may eventually be forced to pay someone, through RSL or otherwise.<\/p>\n<p>          Sign Up for John Herrman column alerts<\/p>\n<p>Get an email alert as soon as a new article publishes.<\/p>\n<p>        Vox Media, LLC Terms and Privacy Notice<\/p>\n<p class=\"expanded-terms \" aria-hidden=\"true\">By submitting your email, you agree to our <a href=\"https:\/\/nymag.com\/newyork\/terms\/\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">Terms<\/a> and <a href=\"https:\/\/nymag.com\/newyork\/privacy\/\" rel=\"noopener noreferrer nofollow\" target=\"_blank\">Privacy Notice<\/a> and to receive email correspondence from us.<\/p>\n","protected":false},"excerpt":{"rendered":"Photo-Illustration: Intelligencer; Photo: Getty Images You can divide the recent history of LLM data scraping into a few&hellip;\n","protected":false},"author":2,"featured_media":21013,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[220,218,219,61,60,18590,3455,80],"class_list":{"0":"post-21012","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-ie","12":"tag-ireland","13":"tag-john-herrman","14":"tag-screen-time","15":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/21012","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/comments?post=21012"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/posts\/21012\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media\/21013"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/media?parent=21012"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/categories?post=21012"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/ie\/wp-json\/wp\/v2\/tags?post=21012"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}