The computing demands of training AI models may get a lot of the attention from the tech industry — just ask NVIDIA’s shareholders. But the needs posed by AI inference may leave today’s cutting-edge GPUs in the dust.
“If you look at the world of pretraining, it has been kind of monolithic,” said Sid Sheth, founder and CEO of d-Matrix, in this episode of The New Stack Makers. “GPUs have dominated. Specifically, GPUs from one company have dominated the landscape. But as you enter the world of inference, it is not really a one-size-fits-all.
“There are too many different workloads, each workload with very different requirements. … you might be a user who cares all about cost. There might be some user who cares all about interactivity, where I really want to be able to interact with the model. There might be users who don’t really care about either, and just care about throughput.”
With this variety of user profiles, Sheth said, “it’s not like the same hardware or the same computing infrastructure can serve all these needs simultaneously.
He summed up, “The world of inference is going to be truly heterogeneous, where you will have dedicated, best-in-class hardware to serve specific needs from specific users.”
Keeping Memory and Compute Close
One of the biggest challenges with inference compute, Sheth said, is keeping the memory (which holds the data) as close as possible to the compute. “The reason for that is you are kind of making a lot of trips to memory. When you talk about generative AI workloads, you’re generating content that relies on caching data. So all the previous data gets cached. And every time you generate a new token you are, essentially, tapping into that cache data to figure out what the next best token needs to be.”
With AI agents, “that problem gets multiplied — 10x or 100x. So, the memory footprint becomes very, very very important, and keeping that memory close to compute becomes very important. The less distance the data has to travel to get to the compute, the faster your inference is going to be. And the more optimal your inference is going to be, the lower cost your inference is going to be.”
In this episode, Sheth discussed and showcased d-Matrix’s AI inference platform, Corsair, which takes an innovative approach to architecting and locating memory and compute. d-Matrix builds specialized chiplets, he said, “and then we co-package these chiplets into a fabric, and that gives us that elasticity and modularity in the platform. We can always scale it up or scale it down, depending on the customer’s requirements.”
In Corsair, memory and compute are layered directly on top of each other — like a stack of pancakes — cutting the travel distance down significantly. “The data is sitting inside this memory, and it’s raining down into the compute, which is sitting right underneath it,” Sheth said. “The surface area is much greater when you package things this way. Obviously, there’s more surface area between the memory and the compute, and a lot more data can drop down into the compute.”
Check out the full episode to learn more about inference, why it needs different infrastructure than AI model training, and what Sheth sees ahead for AI infra more generally.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.
SUBSCRIBE
Group
Created with Sketch.
Heather Joslyn is editor in chief of The New Stack, with a special interest in management and careers issues that are relevant to software developers and engineers. She previously worked as editor in chief of Container Solutions, a Cloud Native consulting…
Read more from Heather Joslyn