{"id":83760,"date":"2025-08-15T03:16:13","date_gmt":"2025-08-15T03:16:13","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/83760\/"},"modified":"2025-08-15T03:16:13","modified_gmt":"2025-08-15T03:16:13","slug":"reddit-to-block-the-internet-archive-from-indexing-the-site","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/83760\/","title":{"rendered":"Reddit to block the Internet Archive from indexing the site"},"content":{"rendered":"<p>There\u2019s an old saying that everything that goes on the internet, stays on the internet.<\/p>\n<p>\n        Featured Video<\/p>\n<p>Of course, this is only true to a certain extent. According to a <a href=\"https:\/\/www.pewresearch.org\/data-labs\/2024\/05\/17\/when-online-content-disappears\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">2024 Pew study<\/a>, one in four webpages that were online at some point between 2013 and 2023 are no longer accessible. For sites from before 2013, this problem is even more pronounced; the Pew study states that 38 percent of webpages that were accessible in 2013 are no longer available.<\/p>\n<p>This is where services like the <a href=\"https:\/\/archive.org\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Internet Archive<\/a> and its Wayback Machine come in. Described on the site as a \u201cdigital library of Internet sites and other cultural artifacts in digital form,\u201d the Wayback Machine allows users to look at defunct websites and older versions of current-day sites. This is an <a href=\"https:\/\/blog.archive.org\/2023\/07\/10\/preserving-the-past-empowering-the-future-unveiling-the-wayback-machines-vital-role-in-investigative-work\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">invaluable tool<\/a> for researchers, as it allows them to see information that is no longer online in addition to how and when sites and articles have been edited.<\/p>\n<p>However, this tool is about to be slightly less effective, as <a href=\"https:\/\/dailydot.com\/tags\/reddit\" rel=\"nofollow noopener\" target=\"_blank\">Reddit<\/a> recently announced that it would be blocking the service from indexing most of the site moving forward. The reason? A.I.<\/p>\n<p>  A History of Reddit Limiting Access<\/p>\n<p>As reported by <a href=\"https:\/\/www.theverge.com\/news\/757538\/reddit-internet-archive-wayback-machine-block-limit\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">The Verge<\/a>, Reddit will now block the Internet Archive from indexing many of the pages on the site. While the Wayback Machine will still be able to index the homepage, showing which threads on the site were the most popular at a given date and time, they will no longer allow the service to save individual threads.<\/p>\n<p>The reason for this, the social media site says, is the rise of Artificial Intelligence and Large Language Models.\u00a0<\/p>\n<p>In short, while Reddit used to allow free and open access to its API, it has slowly begun to implement fees to use its vast array of content. In 2023, the company <a href=\"https:\/\/www.theverge.com\/2023\/4\/18\/23688463\/reddit-developer-api-terms-change-monetization-ai\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">announced<\/a> that it would begin charging companies for developer access to its API, and in 2024, it began to <a href=\"https:\/\/www.404media.co\/google-is-the-only-search-engine-that-works-on-reddit-now-thanks-to-ai-deal\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">charge search engines<\/a> to index its content.<\/p>\n<p>  <img decoding=\"async\" width=\"2000\" height=\"1333\" alt=\"In Body Image\" class=\"wp-image-1887178\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/08\/AdobeStock_823032387_Editorial_Use_Only.jpeg\"  loading=\"lazy\"\/>Koshiro K\/Adobe Stock<\/p>\n<p>Why the sudden clampdown? Since ChatGPT debuted, there\u2019s been a growing interest in the tech sector about Large Language Models \u2014 and, seeing as Reddit is a massive and constantly updating repository of naturalistic user-generated content in multiple languages, it\u2019s become a great tool for harvesting data to train these LLMs.<\/p>\n<p>Why is Reddit Blocking the Internet Archive from Indexing the Site?<\/p>\n<p>Seeing that LLMs were using Reddit\u2019s data, the site began to charge companies for use, <a href=\"https:\/\/www.theverge.com\/2024\/5\/16\/24158529\/reddit-openai-chatgpt-api-access-advertising\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">striking a deal with OpenAI <\/a>and <a href=\"https:\/\/www.theverge.com\/2024\/2\/22\/24080165\/google-reddit-ai-training-data\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Google<\/a> to allow their LLMs to be trained on its data.<\/p>\n<p>The site\u2019s recent clampdown on the <a href=\"https:\/\/www.dailydot.com\/tags\/internet-archive\/\" rel=\"nofollow noopener\" target=\"_blank\">Internet Archive<\/a> is claimed to be related to the use of this data. While companies are supposed to pay Reddit to access its broad swath of content, Reddit spokesperson Tim Rathschmidt claims that some companies are circumventing this by downloading the site from saved versions on the Internet Archive.<\/p>\n<p>\u201cInternet Archive provides a service to the open web, but we\u2019ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,\u201d Rathschmidt told The Verge.<\/p>\n<p><img decoding=\"async\" width=\"2000\" height=\"1333\" alt=\"In Body Image\" class=\"wp-image-1887179\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/08\/AdobeStock_823030512_Editorial_Use_Only.jpeg\"  loading=\"lazy\"\/>Koshiro K\/Adobe Stock<\/p>\n<p>However, this doesn\u2019t appear to be the only reason. Rathschmidt added that \u201cuntil [the Internet Archive is] able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we\u2019re limiting some of their access to Reddit data to protect redditors.\u201d<\/p>\n<p>These limitations will be implemented slowly, with the company saying that they will \u201cinform [the Internet Archive] of the limits before they go into effect.\u201d In response, Mark Graham, director of the Wayback Machine, said in a statement to The Verge that \u201cWe have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter.\u201d<\/p>\n<p>Redditors React<\/p>\n<p>On Reddit, a <a href=\"https:\/\/www.reddit.com\/r\/technology\/comments\/1mniom8\/reddit_will_block_the_internet_archive\/?sort=top\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">thread on the r\/technology subreddit<\/a> about this news quickly racked up over 30 thousand upvotes, with many claiming that stories like these showed how the days of a free and open internet were gradually coming to an end.<\/p>\n<p>\u201cOutrageous, especially with how often posts, threads and users get deleted,\u201d <a href=\"https:\/\/www.reddit.com\/r\/technology\/comments\/1mniom8\/comment\/n852wwp\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">wrote<\/a> a user.<\/p>\n<p>\u201cNew age of internet censorship,\u201d declared a <a href=\"https:\/\/www.reddit.com\/r\/technology\/comments\/1mniom8\/comment\/n85718n\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">second<\/a>, citing issues like the U.K.\u2019s new <a href=\"https:\/\/www.dailydot.com\/news\/uk-age-verification-law\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">age verification law<\/a>.\u00a0<\/p>\n<p>Others questioned whether Reddit was being truthful in their statements, claiming that \u201cscraping\u201d the Internet Archive would be a difficult and time-consuming process. Instead, they alleged other factors may be at play.<\/p>\n<p>\u201cIt\u2019s just bull****. The internet archive has pretty aggressive rate limiting, and the loading speed isn\u2019t very fast in the first place,\u201d said a <a href=\"https:\/\/www.reddit.com\/r\/technology\/comments\/1mnqhsj\/comment\/n88xs6u\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">commenter<\/a>. \u201cScraping the Wayback machine isn\u2019t exactly efficient. It\u2019s just a false pretense to squeeze them for some money.\u201d<\/p>\n<p>\u201cThis makes zero sense. If anyone has used the Internet Archive, they will quickly realize how difficult it would be to scrape because it is so d***ed slow!\u201d exclaimed <a href=\"https:\/\/www.reddit.com\/r\/technology\/comments\/1mniom8\/comment\/n85r7na\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">another<\/a>.<\/p>\n<p>\u201cReddit can\u2019t have people recording all of the admin\/moderator manipulation. It ruins their platform\u2019s credibility. And thus its cultural relevance and shareholder value,\u201d suggested a <a href=\"https:\/\/www.reddit.com\/r\/technology\/comments\/1mniom8\/comment\/n8558jf\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">third<\/a>.<\/p>\n<p>We\u2019ve reached out to Reddit and the Internet Archive via email.<\/p>\n<p>The internet is chaotic\u2014but we\u2019ll break it down for you in one daily email. Sign up for the Daily Dot\u2019s newsletter\u00a0<a href=\"https:\/\/www.dailydot.com\/newsletter\/\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"There\u2019s an old saying that everything that goes on the internet, stays on the internet. Featured Video Of&hellip;\n","protected":false},"author":2,"featured_media":83761,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43],"tags":[182,58283,3195,174,54503,250,58284,58285,74],"class_list":{"0":"post-83760","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-internet","8":"tag-ai","9":"tag-apple-news-feed","10":"tag-chatgpt","11":"tag-internet","12":"tag-internet-archive","13":"tag-reddit","14":"tag-samsung-news-feed","15":"tag-tech-culture","16":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/83760","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=83760"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/83760\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/83761"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=83760"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=83760"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=83760"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}