From a damning new Stanford paper on the illusion of visual understanding in LLMs:
“Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided, we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. “
AGI this stuff ain’t.
This study reinforces what Anh Totti Nguyen has been saying for a long time, in a series of underappreciated papers like Vision Language Models are Blind that I keep trying to draw attention to.
Also, re the very active discussion on AI and jobs: although some white collar jobs (e.g., entry-level coder or market research assistant) may be in near-term jeopardy, many of those that require visual understanding (architect, cartographer, civil engineer, film editor, medical illustrator, urban planner, etc) probably aren’t vulnerable until entirely new techniques are developed.
And humanoid home robots? Don’t make me laugh. If your humanoid robot can’t understand the visual world, it’s just a demo, and not something you can trust.
