{"id":459658,"date":"2026-02-05T08:10:14","date_gmt":"2026-02-05T08:10:14","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/459658\/"},"modified":"2026-02-05T08:10:14","modified_gmt":"2026-02-05T08:10:14","slug":"news-sites-are-locking-out-the-internet-archive-to-stop-ai-crawling-is-the-open-web-closing","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/459658\/","title":{"rendered":"News sites are locking out the Internet Archive to stop AI crawling. Is the \u2018open web\u2019 closing?"},"content":{"rendered":"<p>When the World Wide Web went live in the early 1990s, its founders <a href=\"https:\/\/www.theguardian.com\/technology\/2017\/mar\/11\/tim-berners-lee-web-inventor-save-internet\" rel=\"nofollow noopener\" target=\"_blank\">hoped<\/a> it would be a space for anyone to share information and collaborate. But today, the free and open web is shrinking.<\/p>\n<p>The <a href=\"https:\/\/archive.org\/\" rel=\"nofollow noopener\" target=\"_blank\">Internet Archive<\/a> has been recording the history of the internet and making it available to the public through its <a href=\"https:\/\/web.archive.org\/\" rel=\"nofollow noopener\" target=\"_blank\">Wayback Machine<\/a> since <a href=\"https:\/\/theconversation.com\/the-internet-archive-has-been-fighting-for-25-years-to-keep-whats-on-the-web-from-disappearing-and-you-can-help-163867\" rel=\"nofollow noopener\" target=\"_blank\">1996<\/a>. Now, some of the world\u2019s biggest news outlets are <a href=\"https:\/\/www.engadget.com\/ai\/publishers-are-blocking-the-internet-archive-for-fear-ai-scrapers-can-use-it-as-a-workaround-204001754.html\" rel=\"nofollow noopener\" target=\"_blank\">blocking<\/a> the archive\u2019s access to their pages.<\/p>\n<p>Major publishers \u2013 including The Guardian, The New York Times, the Financial Times, and USA Today \u2013 have confirmed they\u2019re ending the Internet Archive\u2019s access to their content. <\/p>\n<p>While <a href=\"https:\/\/www.niemanlab.org\/2026\/01\/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns\/\" rel=\"nofollow noopener\" target=\"_blank\">publishers say<\/a> they support the archive\u2019s preservation mission, they argue unrestricted access creates unintended consequences, exposing journalism to AI crawlers and members of the public trying to skirt their paywalls.  <\/p>\n<p>Yet, publishers don\u2019t simply want to lock out AI crawlers. Rather, they want to <a href=\"https:\/\/digiday.com\/media\/media-briefing-associated-press-deal-cements-microsofts-quiet-rise-in-ai-licensing\/\" rel=\"nofollow noopener\" target=\"_blank\">sell their content<\/a> to data-hungry tech companies. Their back catalogues of news, books and other media have become a <a href=\"https:\/\/digiday.com\/media\/ap-makes-its-archive-ai-ready-to-tap-the-enterprise-rag-boom\/\" rel=\"nofollow noopener\" target=\"_blank\">hot commodity<\/a> as data to train AI systems.<\/p>\n<p>Robot readers<\/p>\n<p>Generative AI systems such as ChatGPT, Copilot and Gemini require access to large archives of content (such as media content, books, art and academic research) for <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3442188.3445922\" rel=\"nofollow noopener\" target=\"_blank\">training<\/a> and to <a href=\"https:\/\/doi.org\/10.1007\/s12599-025-00945-3\" rel=\"nofollow noopener\" target=\"_blank\">answer user prompts<\/a>. <\/p>\n<p>Publishers claim technology companies have accessed a lot of this content for free and <a href=\"https:\/\/theconversation.com\/meta-allegedly-used-pirated-books-to-train-ai-australian-authors-have-objected-but-us-courts-may-decide-if-this-is-fair-use-253105\" rel=\"nofollow noopener\" target=\"_blank\">without the consent of copyright owners<\/a>. Some began taking tech companies to court, claiming they had stolen their intellectual property. High-profile examples include <a href=\"https:\/\/www.nytimes.com\/2023\/12\/27\/business\/media\/new-york-times-open-ai-microsoft-lawsuit.html\" rel=\"nofollow noopener\" target=\"_blank\">The New York Times<\/a>\u2019 case against ChatGPT\u2019s parent company OpenAI and <a href=\"https:\/\/www.theguardian.com\/technology\/2024\/oct\/21\/rupert-murdoch-ai-lawsuit-new-york-post-dow-jones\" rel=\"nofollow noopener\" target=\"_blank\">News Corp\u2019s lawsuit against Perplexity AI<\/a>. <\/p>\n<p>            <a href=\"https:\/\/images.theconversation.com\/files\/716449\/original\/file-20260205-64-vz66wk.jpg?ixlib=rb-4.1.0&amp;q=45&amp;auto=format&amp;w=1000&amp;fit=clip\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" alt=\"The outside of the New York Times building in New York\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/file-20260205-64-vz66wk.jpg\" class=\"native-lazy\" loading=\"lazy\"  \/><\/a><\/p>\n<p>              The New York Times has sued OpenAI for alleged copyright infringement.<br \/>\n              <a class=\"source\" href=\"https:\/\/photos.aap.com.au\/search\/20231228001882529073\" rel=\"nofollow noopener\" target=\"_blank\">Sarah Yenesel\/EPA<\/a><\/p>\n<p>Old news, new money<\/p>\n<p>In response, some tech companies have <a href=\"https:\/\/digiday.com\/media\/news-corp-in-talks-with-google-for-ai-licensing-deal\/\" rel=\"nofollow noopener\" target=\"_blank\">struck<\/a> <a href=\"https:\/\/www.theguardian.com\/gnm-press-office\/2025\/feb\/14\/guardian-media-group-announces-strategic-partnership-with-openai\" rel=\"nofollow noopener\" target=\"_blank\">deals<\/a> to pay for access to publishers\u2019 content. NewsCorp\u2019s contract with OpenAI is reportedly <a href=\"https:\/\/www.wsj.com\/business\/media\/openai-news-corp-strike-deal-23f186ba\" rel=\"nofollow noopener\" target=\"_blank\">worth more than US$250 million<\/a> over five years.<\/p>\n<p>Similar deals have been struck between academic publishers and tech companies. Publishing houses such as <a href=\"https:\/\/theconversation.com\/academic-publishing-is-a-multibillion-dollar-industry-its-not-always-good-for-science-250056\" rel=\"nofollow noopener\" target=\"_blank\">Taylor &amp; Francis and Elsevier<\/a> have come under scrutiny in the past for locking publicly funded research behind commercial paywalls. <\/p>\n<p>Now, <a href=\"https:\/\/www.insidehighered.com\/news\/faculty-issues\/research\/2024\/07\/29\/taylor-francis-ai-deal-sets-worrying-precedent\" rel=\"nofollow noopener\" target=\"_blank\">Taylor &amp; Francis<\/a> has signed a US$10 million nonexclusive deal with Microsoft granting the company access to over 3,000 journals. <\/p>\n<p>Publishers are also using <a href=\"https:\/\/www.editorandpublisher.com\/stories\/revenue-roundup-can-ai-marketplaces-save-publishings-future,258345\" rel=\"nofollow noopener\" target=\"_blank\">technology to stop unwanted AI bots<\/a> accessing their content, including the crawlers used by the Internet Archive to record internet history. News publishers have referred to the Internet Archive as a \u201c<a href=\"https:\/\/www.niemanlab.org\/2026\/01\/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns\/\" rel=\"nofollow noopener\" target=\"_blank\">back door<\/a>\u201d to their catalogues, allowing unscrupulous tech companies to continue scraping their content.<\/p>\n<p>            <a href=\"https:\/\/images.theconversation.com\/files\/716451\/original\/file-20260205-56-f2be09.jpg?ixlib=rb-4.1.0&amp;q=45&amp;auto=format&amp;w=1000&amp;fit=clip\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" alt=\"A person browses the Internet Archive on a laptop\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2026\/02\/file-20260205-56-f2be09.jpg\" class=\"native-lazy\" loading=\"lazy\"  \/><\/a><\/p>\n<p>              The Internet Archive has been systematically archiving the web for about three decades.<br \/>\n              <a class=\"source\" href=\"https:\/\/www.gettyimages.com.au\/detail\/news-photo\/in-this-photo-illustration-a-woman-browses-the-internet-news-photo\/2205904515\" rel=\"nofollow noopener\" target=\"_blank\">Serene Lee\/SOPA Images\/LightRocket via Getty Images<\/a><\/p>\n<p>The cost of making news free<\/p>\n<p>The Wayback Machine has also been used by members of the public to <a href=\"https:\/\/www.niemanlab.org\/2026\/01\/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns\/\" rel=\"nofollow noopener\" target=\"_blank\">avoid newspaper paywalls<\/a>. Understandably, media outlets want readers to pay for news. <\/p>\n<p>News is a business, and its <a href=\"https:\/\/www.cjr.org\/tow_center_reports\/platform-press-how-silicon-valley-reengineered-journalism.php\" rel=\"nofollow noopener\" target=\"_blank\">advertising revenue model<\/a> has come under increasing pressure from the same tech companies using news content for AI training and retrieval. But this comes at the expense of public access to credible information. <\/p>\n<p>When newspapers first started moving their content online and making it free to the public in the late 1990s, they contributed to the ethos of sharing and collaboration on the early web.<\/p>\n<p>In hindsight, however, one commentator called free access the \u201c<a href=\"https:\/\/www.theatlantic.com\/technology\/archive\/2014\/08\/advertising-is-the-internets-original-sin\/376041\/\" rel=\"nofollow noopener\" target=\"_blank\">original sin<\/a>\u201d of online news. The public became accustomed to getting their digital editions for free, and as online business models shifted, many mid- and small-sized news companies struggled to fund their operations.<\/p>\n<p>The opposite approach \u2013 placing all commercial news behind paywalls \u2013 has its own problems. As news publishers move to <a href=\"https:\/\/doi.org\/10.26493\/1854-6935.18.323-348\" rel=\"nofollow noopener\" target=\"_blank\">subscription-only models<\/a>, people have to juggle multiple expensive subscriptions or limit their <a href=\"https:\/\/reutersinstitute.politics.ox.ac.uk\/sites\/default\/files\/2025-06\/Digital_News-Report_2025.pdf\" rel=\"nofollow noopener\" target=\"_blank\">news appetite<\/a>. Otherwise, they\u2019re left with whatever news remains online for free or is served up by social media <a href=\"https:\/\/link.springer.com\/book\/10.1007\/978-3-030-87086-7\" rel=\"nofollow noopener\" target=\"_blank\">algorithms<\/a>. The result is a more closed, commercial internet.<\/p>\n<p>This isn\u2019t the first time that the Internet Archive has been in the <a href=\"https:\/\/theconversation.com\/internet-archives-digital-library-has-been-found-in-breach-of-copyright-the-decision-has-some-important-implications-212091\" rel=\"nofollow noopener\" target=\"_blank\">crosshairs of publishers<\/a>, as the organisation was previously sued and found to be in breach of copyright through its Open Library project.<\/p>\n<p>The past and future of the internet<\/p>\n<p>The Wayback Machine has served as a public record of the web for <a href=\"https:\/\/theconversation.com\/the-internet-archive-has-been-fighting-for-25-years-to-keep-whats-on-the-web-from-disappearing-and-you-can-help-163867\" rel=\"nofollow noopener\" target=\"_blank\">more than three decades<\/a>, used by researchers, educators, journalists and amateur internet historians. <\/p>\n<p>Blocking its access to international newspapers of note will leave significant holes in the public record of the internet.  <\/p>\n<p>Today, you can use the <a href=\"https:\/\/web.archive.org\/web\/19970601173955\/http:\/\/www.thenewyorktimes.com\/\" rel=\"nofollow noopener\" target=\"_blank\">Wayback Machine<\/a> to see The New York Times\u2019 front page from June 1997: the first time the Internet Archive crawled the newspaper\u2019s website. In another 30 years, internet researchers and curious members of the public won\u2019t have access to today\u2019s front page, even if the Internet Archive is still around.<\/p>\n<p>Today\u2019s websites become tomorrow\u2019s historical records. Without the preservation efforts of not-for-profit organisations like The Internet Archive, <a href=\"https:\/\/theconversation.com\/academic-publishing-is-a-multibillion-dollar-industry-its-not-always-good-for-science-250056\" rel=\"nofollow noopener\" target=\"_blank\">we risk losing vital records<\/a>.<\/p>\n<p>Despite the actions of commercial publishers and <a href=\"https:\/\/theconversation.com\/wikipedia-at-25-can-its-original-ideals-survive-in-the-age-of-ai-273473\" rel=\"nofollow noopener\" target=\"_blank\">emerging challenges of AI<\/a>, not-for-profit organisations such as the Internet Archive and <a href=\"https:\/\/theconversation.com\/wikipedia-at-25-can-its-original-ideals-survive-in-the-age-of-ai-273473\" rel=\"nofollow noopener\" target=\"_blank\">Wikipedia<\/a> aim to keep the dream of an open, collaborative and transparent internet alive.<\/p>\n","protected":false},"excerpt":{"rendered":"When the World Wide Web went live in the early 1990s, its founders hoped it would be a&hellip;\n","protected":false},"author":2,"featured_media":459659,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-459658","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/459658","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=459658"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/459658\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/459659"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=459658"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=459658"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=459658"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}