I have shared many examples of the unreliability of OpenAI’s ChatGPT, Google’s Gemini, and other large language models (LLMs). Recent examples include evaluating a rotated tic-tac-toe grid, and labeling a possum’s body parts. Some correspondents have objected that no one is going to ask an LLM to evaluate an obviously pointless tic-tac-toe rotation or to identify a possum’s tail and snout.
I should have been clearer. The many examples I’ve reported are not just “gotchas” meant to embarrass LLMs. They are intended to demonstrate in a clear and compelling way that LLMs are inherently unreliable.
Going back to my Oxford University Press tetralogy, The AI Delusion, The 9 Pitfalls of Data Science (co-authored with Jay Cordes), The Phantom Pattern Problem, and Distrust: Big Data, Data-Torturing, and the Assault on Science, I have argued relentlessly that AI systems based on data mining are unreliable.
Why the LLM system is inherently unreliable
The scientific method begins with a specific hypothesis that is tested with data, ideally a randomized controlled trial. Data mining goes in the other direction, beginning with data and looking to find what have been called “hidden patterns and secret correlations.” When a heretofore unknown pattern or correlation is found, data miners either concoct a plausible theory or argue that theories are not needed. For example, in a Wired piece titled, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete,” Chris Anderson wrote,
Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
The problem is that the number of genuinely useful relationships is limited. But the data deluge is increasing the number of coincidental, useless statistical patterns exponentially — so the probability that a data-mined pattern is actually useful is getting ever closer to zero. As I have written elsewhere:
This is the paradox of big data: It would seem that having data for a large number of variables will help us find more reliable patterns; however, the more variables we consider, the less likely it is that what we find will be useful.
Data-mining algorithms look for statistical patterns with no idea how the data relate to the real world. The data might as well be labeled &4qp#L, M&2V+v, or any other randomly generated collection of ASCII characters. Data-mining algorithms are consequently unable to distinguish between useful causation and useless correlation. Here are a few examples of misplaced faith in data mining:
Amazon book cover for The AI Delusion by Gary Smith
Admiral Insurance, Britain’s largest car insurance company, planned to launch firstcarquote, which would base its car insurance rates on a data-mined analysis of an applicant’s Facebook posts — before being blocked by Facebook lawyers.
Yongqianbao, a Chinese tech company, reported that they had developed a data-mining algorithm that evaluates loan applications based on applicants’ smartphone usage, including such measures as whether they keep their phones fully charged and how many calls go unanswered. The company is no longer in business.
Gild developed data-mining software for evaluating applicants for software engineering jobs by monitoring their online activities. Even the chief scientist acknowledged that some of the factors chosen by its data-mining software do not make sense, saying, “Obviously, it’s not a causal relationship.” A former employee wrote that, “Customers really, really hate the product. There are almost no customers that have an overall positive experience.”
A JP Morgan study concluded that Trump tweets containing the words China, billion, products, Democrats, or great have a “statistically significant impact” on interest rates. They don’t.
Researchers found a correlation between a three-week moving average of Google searches for the word debt and changes in the Dow Jones Industrial Average. Using data for a 7-year period, they reported that their debt strategy had an astounding 23.0% annual return, compared with 2.2% for a buy-and-hold strategy. However, for the 7 years after their study, their debt strategy had an annual return of 2.81%, compared with 8.60% for buy-and-hold.
The perils of data mining
LLMs mine data but they search for statistical patterns in words instead of numbers. Similar to traditional data-mining algorithms, LLMs do not know how the text they input and output relate to the real world and consequently have no way of knowing whether the statements they process and generate are true or false, sensible or nonsense.
Here are a few examples of mishaps caused by this core problem:
Air Canada was sued and lost a case in which a customer-service chatbot generated incorrect information about the airline’s bereavement-fare policy.
A car dealer’s ChatGPT bot agreed to sell a $76,000 Chevy Tahoe for $1. The bot said, “That’s a deal, and that’s a legally binding offer – no takesies backsies,” but the dealer refused to honor the deal.
A New York Federal court sanctioned experienced lawyers who used ChatGPT to generate a court filing that cited non-existence cases.
A Greek woman filed for divorce because a ChatGPT reading of photos of coffee grounds left in her cup and her husband’s cup indicated that he was having an affair with a “home wrecker.”
LLM-based hiring systems have been shown to generate outcomes that are biased with respect to race and sex.
A study of the use of OpenAI’s Whisper transcription tool to create records of doctor-patient interactions found that it often created entirely made-up phrases and sentences, many of which were potentially harmful.
A man was reportedly hospitalized for severe psychiatric problems after substituting sodium bromide for table salt following a ChatGPT session.
The parents of a 16-year-old who hung himself filed a wrongful-death lawsuit against OpenAI, alleging that ChatGPT had given their son a “step-by-step playbook” on how to hang himself, including advice on the best knot to use and offering to write a suicide note for him.
Scaling up (by increasing the number of parameters, the size of the training data, and the computational power) won’t solve this fundamental weakness — that LLMs are unreliable because they do not know how words relate to the real world.
Why there is no simple fix
Expert training can help LLMs give better answers to the prompts they are trained on but (1) the experts cannot anticipate all prompts an LLM might be asked, (2) many (most?) real world decisions involve subjective probabilities that cannot be reliably generated by clueless LLMs and cannot be anticipated by human trainers, and (3) following instructions is not general intelligence.
The point of the hundreds of examples I have given over the years is that predictions, advice, or conclusions based solely on statistical patterns are unreliable. Data mining is dodgy because correlation does not supersede causation.