{"id":237731,"date":"2025-10-24T14:18:12","date_gmt":"2025-10-24T14:18:12","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/237731\/"},"modified":"2025-10-24T14:18:12","modified_gmt":"2025-10-24T14:18:12","slug":"researchers-show-that-training-on-junk-data-can-lead-to-llm-brain-rot","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/237731\/","title":{"rendered":"Researchers show that training on \u201cjunk data\u201d can lead to LLM \u201cbrain rot\u201d"},"content":{"rendered":"<p>On the surface, it seems obvious that training an LLM with \u201chigh quality\u201d data will lead to better performance than feeding it any old \u201clow quality\u201d junk you can find. Now, a group of researchers is attempting to quantify just how much this kind of low quality data can cause an LLM to experience effects akin to human \u201cbrain rot.\u201d<\/p>\n<p>For <a href=\"https:\/\/llm-brain-rot.github.io\/\" rel=\"nofollow noopener\" target=\"_blank\">a pre-print paper published this month<\/a>, the researchers from Texas A&amp;M, the University of Texas, and Purdue University drew inspiration from existing research showing how humans who consume \u201clarge volumes of trivial and unchallenging online content\u201d can develop problems with attention, memory, and social cognition. That led them to what they\u2019re calling the \u201cLLM brain rot hypothesis,\u201d summed up as the idea that \u201ccontinual pre-training on junk web text induces lasting cognitive decline in LLMs.\u201d<\/p>\n<p>Figuring out what counts as \u201cjunk web text\u201d and what counts as \u201cquality content\u201d is far from a simple or fully objective process, of course. But the researchers used a few different metrics to tease a \u201cjunk dataset\u201d and \u201ccontrol dataset\u201d from <a href=\"https:\/\/huggingface.co\/datasets\/enryu43\/twitter100m_tweets\" rel=\"nofollow noopener\" target=\"_blank\">HuggingFace\u2019s corpus of 100 million tweets<\/a>.<\/p>\n<p>Since brain rot in humans is \u201ca consequence of Internet addiction,\u201d they write, junk tweets should be ones \u201cthat can maximize users\u2019 engagement in a trivial manner.\u201d As such, the researchers created one \u201cjunk\u201d dataset by collecting tweets with high engagement numbers (likes, retweets, replies, and quotes) and shorter lengths, figuring that \u201cmore popular but shorter tweets will be considered to be junk data.\u201d<\/p>\n<p>For a second \u201cjunk\u201d metric, the researchers drew from marketing research to define the \u201csemantic quality\u201d of the tweets themselves. Using a complex GPT-4o prompt, they sought to pull out tweets that focused on \u201csuperficial topics (like conspiracy theories, exaggerated claims, unsupported assertions or superficial lifestyle content)\u201d or that had an \u201cattention-drawing style (such as sensationalized headlines using clickbait language or excessive trigger words).\u201d A random sample of these LLM-based classifications was spot-checked against evaluations from three graduate students with a 76 percent matching rate.<\/p>\n","protected":false},"excerpt":{"rendered":"On the surface, it seems obvious that training an LLM with \u201chigh quality\u201d data will lead to better&hellip;\n","protected":false},"author":2,"featured_media":237732,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-237731","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/237731","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=237731"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/237731\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/237732"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=237731"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=237731"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=237731"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}