Hi Kenny, I appreciate that you and others might want LLMs to be more capable than they are, esp. as they can feel like that at times. But in your comments above and below, you’re pointing to a difference that doesn’t make a difference:
First, I said “At their core, LLMs cannot…” There’s a lot of nuance that my brief, general description naturally doesn’t capture. E.g., you’re right that LLMs can be made more accurate by adding layers of processing as safeguards, e.g., RAG.
Still, those layers and nuances don’t magically transform LLMs into something else. At the end of the day, they still answer your queries by saying things that other people might say in response (and people can say all kinds of wacky things).
In your example above, LLMs are just double-checking their initial output against what those human raters might say.
In your example below (on math), they’re also just double-checking their initial output against what human experts in math might say. Those LLMs still aren’t calculators, because they can still get it wrong.
I’m not saying it’s useless that LLMs answer your queries based on how other people might answer you. For some domains that are well known, e.g., “What is the capital of France?”, what other people might say in response would likely coincide with the fact of the matter. The reason why is that LLMs have been trained on tons of examples of “Paris is the capital of France.” So, LLMs can indirectly model the world by modeling our language.
But in less-known domains, they are much more likely to fabricate or falsely hallucinate, including giving a particular voice (or article or paper, etc.) disproportionate weight, just because there aren’t many other competing voices. Here are a couple examples from this year, not older LLMs.
Anyway, you don’t need to believe me, but just ask AI, if you think it’s more capable than I do. Here’s what Google’s AI thinks, and you can get similar results with other AI:
Query:
Is this essentially correct?:
“As large language models (LLMs), they model languages, given a large training dataset to draw/remix from. This means they aren’t designed to model the world, or domain knowledge, or whatever. They just model what other people might say in response to your prompt. Sometimes that can be useful, other times it s not; so, your mileage may vary.”
AI:
Yes, your statement is essentially correct and aligns with the prevailing “stochastic parrot” or “fancy autocomplete” view of Large Language Models (LLMs).
It accurately highlights that LLMs are trained to predict the next word (token) in a sequence based on statistical patterns in massive datasets, not to “understand” the real world, possess domain knowledge, or reason logically in the human sense.
Here is a breakdown of your points based on current understanding…
Feel free to read the rest of the answer as you like.
Of course, there’s nuance in all this, as I was giving a top-level description of LLMs; but those nuances (like the differences you picked out) don’t really make a difference to that general description…