
Actor, Rupert Grint (L) Emma Watson (M) and Daniel Radcliffe (R) on the set of the film ‘Harry Potter and The Prisoner of Azkaban’, London, England, 2003 (Photo by Murray Close/ Getty Images)
Getty Images
When investing in any company or industry — or if you depend on one for your income — it’s important to know the potential risks. Generative artificial intelligence, at least in the form of large language models like ChatGPT, has weaknesses. New research shows a serious one.
A Known AI Industry Risk Example
An example of a risk in the industry is that vendors have invested in and offered advanced credit to one another, which essentially is like building an industry on forms of debt.
BNY Mellon estimates that in 2025, such cloud infrastructure providers as Amazon, Alphabet/Google, Meta, Microsoft, and Oracle raised $121 billion in new debt, with more than $90 billion of that raised in the fourth quarter of the year. Credit spreads widened for all the companies, Oracle and Meta most of all, according to the report. Investors are increasingly turning to credit default swaps (which became one of the massive implosions during the global financial crisis.
BNY Mellon also noted, “UBS analysts foresee as much as $900 billion in new debt from global companies in 2026. Further out, Morgan Stanley and JP Morgan project the technology sector may need to issue as much as $1.5 trillion in new debt over the next few years to finance AI and data center infrastructure construction.”
Copyright Is At The Heart Of What LLMs Do
As important as the financial issues are, they could be an aside when it comes to other business fundamentals, like intellectual property. The LLM vendors have been fighting lawsuits over copyright owned by materials they’ve frequently used for training without payment. (Legal issues get complicated. I wrote an article about this at the American Bar Association Journal in 2023, if you’d like to read about some of the basics, not including developments since then, although many of the legal issues remain unresolved.)
A defense the industry has used is that the software doesn’t store the original works, that those get set aside. Instead, the systems store complex relationships among words from many published pieces, using sophisticated statistical methods to choose an appropriate message to a user. Supposedly, retrieving similar blocks of writing to originals is rare.
Research Shows Reproduction
At least, that was the presumption. New research from Ahmed Ahmed, Sanmi Koyejo, and Percy Liang of Stanford University and A. Feder Cooper of both Stanford and Yale University delved into a problem that turned up when the New York Times sued OpenAI (creators of ChatGPT) and Microsoft. One of the points in the complaint was, “Powered by LLMs containing copies of Times content, Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples.”
The researchers found that the ability to reproduce materials extends beyond articles to entire books. “While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models,” the abstract said.
A question has remained whether full production models could allow such reproduction. The researchers found they could. The first step is called a Best-of-N jailbreak, a technique researchers found in 2024 that “works by repeatedly sampling variations of a prompt with a combination of augmentations – such as random shuffling or capitalization for textual prompts – until a harmful response is elicited.”
The new research then follows with “ iterative continuation prompts to attempt to extract the book” in question. They tried it with four production LLMs: Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3.
Copyrighted Works Don’t Want To Be Free
Gemini 2.5 Pro and Grok 3 didn’t require jailbreaking to extract 76.8% and 70.3%, respectively, of the Harry Potter and the Sorcerer’s Stone. Claude 3.7 Sonnet and GPT-4.1 both required jailbreaking. In all, they attempted to extract 13 books, 11 under U.S. copyright and two public domain.
The other 10 copyrighted books are Harry Potter and the Goblet of Fire, 1984, The Hobbit, The Catcher in the Rye, A Game of Thrones, Beloved, The Da Vinci Code, The Hunger Games, Catch-22, and The Duchess War. The public domain books are Frankenstein and The Great Gatsby.
Sometimes they could only get sections of a book. “For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984.” They also explicitly said that extraction doesn’t always succeed.
But this undercuts the “we don’t store whole works” narrative, even if entire pieces of writing aren’t stored in a single block. It’s normal for computers to break up files into pieces stored in different locations. You may have heard the term defragmentation, which is when files are put back together as much as possible, freeing up blocks of space for more storage, all of which means more efficient access. That is different, clearly, but if you can reconstruct the original, have you really not stored it?