Large language models are powerful, but generally they require vast computing resources, which means they typically have to run on stacks of high-end GPUs in data centers. Now, startup Multiverse Computing has created models it says are comparable in size to the brains of chickens and flies—allowing the company to shrink powerful LLMs so that they can run on home appliances, smartphones, or cars.

Multiverse, based in Donostia, Spain, is working at the intersection of two of technology’s most in-vogue fields—AI and quantum computing. The company’s flagship product is a software platform called Singularity, designed to allow nonexperts to work with quantum algorithms, but it has also developed compression technology called CompactifAI for shrinking neural networks.

The software relies on tensor networks—mathematical tools originally developed to simulate quantum systems on classical hardware. But their ability to distill complex multidimensional systems into something more compact and easier to work with also makes them a promising avenue for compressing large AI models.

Multiverse’s Nano Models Shrink AI

Multiverse has now used CompactifAI to create a new family of “nano models” that it calls Model Zoo, with each one named after the animal whose brain (theoretically) has a comparable amount of processing power. The first two releases are a compressed version of Meta’s Llama 3.1 model dubbed ChickenBrain, which can bring reasoning capabilities to a Raspberry Pi, and a version of the open-source model SmolLM2 135M small enough to run on a smartphone, dubbed SuperFly.

“SuperFly is a 94-million-parameter model, which is tiny. It’s definitely one of the smallest LLMs out there,” says Sam Mugel, Multiverse’s chief technology officer. “Any device that’s expensive enough that you could justify putting a Raspberry Pi in would be able to host an LLM like SuperFly.” That means expensive electronics like a washing machine or a fridge could now have AI capabilities they would otherwise not be able to incorporate.

The company says this could bring AI capabilities to a wide range of appliances, and in particular the ability to control devices using natural language. Being able to run LLMs locally rather than via the cloud has a host of benefits, says Mugel, including significantly reduced latency and fewer security and privacy risks due to data being processed on-device.

They could be particularly useful for applications where Internet connections may be unreliable, Mugel says. SuperFly is small enough to be directly embedded in a car’s dashboard, which could allow uninterrupted natural-language control even while driving through tunnels or in areas with poor network coverage.

Compressing models is standard practice these days, thanks to growing concerns around the energy and hardware footprints of the largest models. Neural networks are surprisingly inefficient learners and contain a lot of redundant information, says Mugel, which leaves a lot of room for optimization.

This is typically done using techniques like quantization, which involves using fewer bits to represent a model’s weights, or pruning out connections in the neural network that aren’t contributing much to performance. But Mugel says Multiverse’s quantum-inspired tensor networks approach can go further than either of these more conventional approaches, and can also be combined with quantization to push compression even further.

The first step in the process involves scanning a model to see which layers are most suitable for compression. These layers are then reorganized into tensor networks, which retain the most important patterns in the layer’s weights while discarding redundant information that isn’t contributing much to overall performance. Finally, the compressed model goes through a “healing” step where it is briefly retrained on the task at hand.

“We’ve reorganized the neural network a little bit, and we’ve done a procedure that might take it out of the optimal points of the training,” says Mugel. “The healing is analogous to how people, after a really bad accident, may need a little bit of rehabilitation. That doesn’t mean relearning a task from scratch, it just means getting familiar with it again.”

Efficient AI for Smartphones

The company used this process to create its SuperFly model, which is roughly 30 percent smaller than the model it was compressed from. At just 94 million parameters, it is comparable in size to two fruit-fly brains, says Mugel, which have roughly 50 million neural connections. When the company’s researchers installed it on an iPhone 14 Pro it took up only 191 megabytes of disk space and could process a respectable 115 tokens per second.

ChickenBrain is considerably larger, at 3.2 billion parameters, which Mugel admits is similar in size to other smaller language models. But this represents a 60 percent reduction from the 8-billion-parameter Llama model it was compressed from. And the team was also able to add reasoning skills to the model despite the significantly reduced footprint, though Multiverse declined to explain how these new capabilities were achieved.

This means that ChickenBrain actually outperforms the model it was compressed from on a range of benchmarks when running on similar hardware, including the language-focused MMLU-Pro, math-focused Math-500 and GSM8K, and general knowledge-focused GPQA-Diamond.

“What we’re demonstrating is that we can modify Llama 3.1 8B to make it more powerful with a fraction of the size,” says Mugel. “It’s an important step for making AI leaner and more efficient, as well as opening up new domains for AI models at the edge.”

Zenglin Xu, a professor at the Artificial Intelligence Innovation and Incubation Institute at Fudan University, in Shanghai, says that tensor networks are a promising tool for compression and often provide better results than similar techniques that attempt to simplify layers of a neural network. However, it remains unclear how well models compressed in this way can deal with more-complicated reasoning tasks. “Especially for problems with longer inference chains, the performance could be suboptimal compared with other techniques,” adds Xu.

And despite the compression achieved so far, Mugel admits that there’s still a long way to go before today’s frontier models can be squeezed onto edge devices. But he says there’s plenty of scope to improve Multiverse’s compression techniques, and at the same time more-efficient architectures are bringing cutting-edge capabilities to ever smaller models.

“How much more can we squeeze out of 3 billion parameters?” he says. “That’s really hard to say, but I do believe the we’re going to see way better performance in the very near future.”

From Your Site Articles

Related Articles Around the Web