{"id":250068,"date":"2025-10-25T04:49:15","date_gmt":"2025-10-25T04:49:15","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/250068\/"},"modified":"2025-10-25T04:49:15","modified_gmt":"2025-10-25T04:49:15","slug":"perplexity-just-got-caught-breaking-the-rules-red-handed","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/250068\/","title":{"rendered":"Perplexity Just Got Caught Breaking the Rules Red-Handed"},"content":{"rendered":"<p>\t<img decoding=\"async\" class=\"archive-post-thumb article-featured-image w-full h-auto mb-3\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/perplexity-caught-breaking-rules.jpg\"   fetchpriority=\"high\" width=\"2048\" height=\"1075\" alt=\"The AI startup Perplexity fell for a clever trap laid by the social media site Reddit that all but proves it scraped data from the forum.\"\/><\/p>\n<p>\t\t\tIllustration by Tag Hartman-Simkins \/ Futurism. Source: Getty Images<\/p>\n<p class=\"pw-incontent-excluded article-paragraph skip\">Over two decades ago, the New Oxford American Dictionary wanted to see if any of its competitors were cribbing its definitions. So it set up a trap. In its first edition, published in 2001, NOAD included a word called \u201cesquivalience,\u201d which it defined as the \u201cwillful avoidance of one\u2019s official responsibilities.\u201d<\/p>\n<p class=\"article-paragraph skip\">The word was a fake. And <a href=\"https:\/\/www.newyorker.com\/magazine\/2005\/08\/29\/not-a-word\" rel=\"noreferrer nofollow noopener\" target=\"_blank\">the bait worked<\/a>: the word reference website Dictionary.com was caught using \u201cesquivalience,\u201d attributing it to Merriam Webster\u2019s New Millennium. Its guilt was undeniable, and the debacle gained considerable media coverage.<\/p>\n<p class=\"article-paragraph skip\">These copyright traps have a name: \u201cmountweazels\u201d \u2014 a term with its <a href=\"https:\/\/www.newyorker.com\/magazine\/2005\/08\/29\/not-a-word\" rel=\"nofollow noreferrer noopener\" target=\"_blank\">own curious history<\/a> \u2014 and an evolution of them is now being used by companies fending off AI data scrapers that devour vast swathes of the internet without asking permission.<\/p>\n<p class=\"article-paragraph skip\">In a lawsuit against four tech companies filed Wednesday and <a href=\"https:\/\/www.nytimes.com\/2025\/10\/22\/technology\/reddit-data-scrapers-perplexity-theft.html\" rel=\"noreferrer nofollow noopener\" target=\"_blank\">covered by The New York Times<\/a>, Reddit revealed how it managed to ensnare the AI startup Perplexity with its own sort of mountweazel. The forum-based social media platform put up a \u201ctest post\u201d on its site that could \u201conly be crawled by Google\u2019s search engine and was not otherwise accessible anywhere on the internet,\u201d it said.\u00a0<\/p>\n<p class=\"article-paragraph skip\">But within hours, Perplexity\u2019s AI-powered search engine showed the content from the trap Reddit post.<\/p>\n<p class=\"article-paragraph skip\">\u201cPerplexity\u2019s business model is effectively to take Reddit\u2019s content from Google search results,\u201d then feed it into an AI model and \u201ccall it a new product,\u201d Reddit lawyers argued in the suit, per the NYT.<\/p>\n<p class=\"article-paragraph skip\">It\u2019s the latest lawsuit to put the AI industry\u2019s voracious use of scraped data under the spotlight. Training the powerful large language models that power AI products like ChatGPT would not have been possible without having free access to an unbelievable wealth of data, much of it copyrighted. Reddit itself is trying to cash in on the AI data demand by locking out scrapers and <a href=\"https:\/\/futurism.com\/the-byte\/reddit-ai-deal-posts\" rel=\"nofollow noopener\" target=\"_blank\">selling its user data at a premium<\/a>. It expects to make over $200 million over the next few years <a href=\"https:\/\/www.wired.com\/story\/reddits-sale-user-data-ai-training-draws-ftc-investigation\/\" rel=\"nofollow noreferrer noopener\" target=\"_blank\">through the data licensing venture<\/a>.<\/p>\n<p class=\"article-paragraph skip\">In addition to Perplexity, the Reddit suit targets three more data scraping firms: SerpApi based in Texas; Oxylabs, a Lithuanian startup; and AWMProxy in Russia, which has <a href=\"https:\/\/krebsonsecurity.com\/2022\/06\/the-link-between-awm-proxy-the-glupteba-botnet\/\" rel=\"noreferrer nofollow noopener\" target=\"_blank\">been linked<\/a> to a notorious malware botnet called Glupteba.<\/p>\n<p class=\"article-paragraph skip\">Years before the AI boom, these companies scraped mountains of Google search data to provide search engine optimization services to businesses. Google\u2019s search results were themselves created by scraping websites and then organizing that data. For the most part, this created a mutually beneficial relationship, since scraping helped direct traffic to the websites the data came from through search results, the NYT explains.<\/p>\n<p class=\"article-paragraph skip\">But then these SEO firms started selling their troves of scraped Google data directly to AI companies. The AI chatbots that were trained on these data sets don\u2019t direct a meaningful amount of traffic to the websites they get their data from \u2014 if they give accurate attributions at all \u2014and suddenly, the relationship became one-sided.<\/p>\n<p class=\"article-paragraph skip\">Reddit, which is <a href=\"https:\/\/futurism.com\/artificial-intelligence\/reddit-ai-recommends-heroin\" rel=\"nofollow noopener\" target=\"_blank\">experimenting with its own built-in AI<\/a>, says that Perplexity bought these firms\u2019 scraped data sets, circumventing a cease and desist order Reddit sent after it caught Perplexity directly scraping data from its posts without paying for it. The lawsuit noted that citations to Reddit data in Perplexity\u2019s AI search results had jumped \u201cfortyfold,\u201d per the NYT.<\/p>\n<p class=\"article-paragraph skip\">More on AI: <a href=\"https:\/\/futurism.com\/artificial-intelligence\/chatgpt-copyrighted-characters\" rel=\"nofollow noopener\" target=\"_blank\">It\u2019s Still Ludicrously Easy to Generate Copyrighted Characters on ChatGPT<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"Illustration by Tag Hartman-Simkins \/ Futurism. Source: Getty Images Over two decades ago, the New Oxford American Dictionary&hellip;\n","protected":false},"author":2,"featured_media":250069,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[182,181,507,74],"class_list":{"0":"post-250068","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/250068","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=250068"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/250068\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/250069"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=250068"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=250068"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=250068"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}