Remember how that infamous Apple reasoning paper, The Illusion of Thinking showed that LLMs had trouble on the Tower of Hanoi, working fine versions with small numbers of disk but breaking down altogether with larger numbers of disks?

I wrote it up (quoting Josh Wolfe’s legendary tweet about how “Apple [had] just GaryMarcus’s LLM reasoning ability”) in an essay called A knockout blow for LLMs, If you have forgotten the situation you can go there for a refresher. It was (and remains) the most popular essay I have posted here.

In part I concluded

What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms. (They also can’t play chess as well as conventional algorithms, can’t fold proteins like special-purpose neurosymbolic hybrids, can’t run databases as well as conventional databases, etc.)

Worse, as the latest Apple papers shows, LLMs may well work on your easy test set (like Hanoi with 4 discs) and seduce you into thinking it has built a proper, generalizable solution when it does not.

At least for the next decade, LLMs (with and without inference time “reasoning”) will continue have their uses, …. But anybody who thinks LLMs are a direct route to the sort AGI that could fundamentally transform society for the good is kidding themselves. This does not mean that the field of neural networks is dead, or that deep learning is dead. LLMs are just one form of deep learning, and maybe others — especially those that play nicer with symbols – will eventually thrive.

At the time many LLM fans were hopping mad, convinced that the Apple paper was biased and must be flawed.

Hundreds of thousands of them (maybe more) circulated a critique of the Apple paper called “The illusion of the illusion” that turned out to be AI-generated and error-ridden, literally written as joke, in the tradition of the Sokal hoax.

In reality, the replies to the Apple reasoning paper were weak sauce (see also this).

Another paper later last summer on “the mirage of reasoning” made similar arguments around chain of reasoning models, and several other papers, like the recent Stanford taxonomy of LLM reasoning errors, have since extended the point. LLMs have persistent weaknesses around planning, reasoning and generalization.

Which is a key part of why many of us have always urged for neurosymbolic hybrids, where neural networks do pattern recognition, and classical AI techniques help out around planning and reasoning.

A new paper from Tufts (released in late February, but just shared with me yesterday) picks up where the Apple reasoning paper left off, and does three interesting things.

First it (conceptually) replicates and extends the Apple reasoning paper, by showing that a newer variant on LLMs, called VLAs –Vision-Language-Action models, which are increasingly popular in robotics — also suffer from the same generalization problems on the Tower Hanoi, similarly working fine on small versions of the problem but breaking down as more rings get added. (Humans who master Hanoi develop more robust solutions.)

Second, they show that a neurosymbolic hybrid model, which integrates both a neural networks (for pattern recognition) and a symbolic planner generalizes far better.: “On the 3-block task, the neuro-symbolic model achieves 95% success compared to 34% for the best-performing VLA. The neuro-symbolic model also generalizes to an unseen 4-block variant (78% success), whereas both VLAs fail to complete the task”.

Third, they show that the neurosymbolic hybrid was vastly more efficent in terms of energy use – by nearly two orders of magnitude. LLMs are an efficient way to pattern recognition where perfect results are not required, but an inefficient way to reason a plan. Different tools for different jobs. A good hybrid neurosymbolic system is about picking the right constellation of tools for a given job.

You can read the full paper by Timothy Duggan, Pierrick Lorang, Hong Lu, and Matthias Scheutz, which very much vindicates the Apple reasoning paper, here.

§

Still, as I wrote yesterday (and also six years ago, in The Next Decade in AI) taking a neurosymbolic approach is not a panacea. It’s just a starting point.

For example, the specific model here is purpose-built; we would like a general purpose system that can induce what is needed for the particular occasion; that’s not this. Claude Code (in particular version 4.6), which does seem genuinely impressive to me and a number of friends I trust, still makes plenty of mistakes, and still can’t be trusted. As one of those friend, deeply positive but still measured, put it, Claude Code is still best treated as a tool, rather than a complete solution. It’s obviously not AGI.

What we have now are lights into the future, not full answers.

But there is more and more reason to think that neurosymbolic AI is a stronger starting point, with potential for better generalization and more efficient performance.

As a society we could spend another trillion dollars pursue LLMs, or a trillion dollars seeking new, hybrid approaches. I for one know which I would pick.