{"id":49077,"date":"2025-08-07T00:15:32","date_gmt":"2025-08-07T00:15:32","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/49077\/"},"modified":"2025-08-07T00:15:32","modified_gmt":"2025-08-07T00:15:32","slug":"confronting-ais-next-big-challenge-inference-compute","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/49077\/","title":{"rendered":"Confronting AI\u2019s Next Big Challenge: Inference Compute"},"content":{"rendered":"<p>The computing demands of training AI models may get a lot of the attention from the tech industry \u2014 just ask <a href=\"https:\/\/thenewstack.io\/nvidia-unveils-next-gen-rubin-and-feynman-architectures-pushing-ai-power-limits\/\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">NVIDIA\u2019s<\/a> shareholders. But the needs posed by AI inference may leave today\u2019s cutting-edge GPUs in the dust.<\/p>\n<p>\u201cIf you look at the world of pretraining, it has been kind of monolithic,\u201d said <a href=\"https:\/\/www.linkedin.com\/in\/sheth\/\" class=\"ext-link\" rel=\"external  nofollow noopener\" onclick=\"this.target=&#039;_blank&#039;;\" target=\"_blank\">Sid Sheth<\/a>, founder and CEO of <a href=\"https:\/\/www.d-matrix.ai\/\" class=\"ext-link\" rel=\"external  nofollow noopener\" onclick=\"this.target=&#039;_blank&#039;;\" target=\"_blank\">d-Matrix<\/a>, in this episode of The New Stack Makers. \u201c<a href=\"https:\/\/thenewstack.io\/how-cios-can-battle-gpu-poverty-in-the-age-of-ai\/\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">GPUs have dominated<\/a>. Specifically, <a href=\"https:\/\/thenewstack.io\/nvidia-wants-to-rewrite-the-software-development-stack\/\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">GPUs from one company<\/a> have dominated the landscape. But as you enter the world of inference, it is not really a one-size-fits-all.<\/p>\n<\/p>\n<p>\u201cThere are too many different workloads, each workload with very different requirements. \u2026 you might be a user who cares all about cost. There might be some user who cares all about interactivity, where I really want to be able to interact with the model. There might be users who don\u2019t really care about either, and just care about throughput.\u201d<\/p>\n<p>With this variety of user profiles, Sheth said, \u201cit\u2019s not like the same hardware or the same computing infrastructure can serve all these needs simultaneously.<\/p>\n<p>He summed up, \u201cThe <a href=\"https:\/\/thenewstack.io\/ai-reliability-engineering-welcome-to-the-third-age-of-sre\/\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">world of inference<\/a> is going to be truly heterogeneous, where you will have dedicated, best-in-class hardware to serve specific needs from specific users.\u201d<\/p>\n<\/p>\n<p>Keeping Memory and Compute Close<\/p>\n<p>One of the biggest challenges with inference compute, Sheth said, is keeping the memory (which holds the data) as close as possible to the compute. \u201cThe reason for that is you are kind of making a lot of trips to memory. When you talk about generative AI workloads, you\u2019re generating content that relies on caching data. So all the previous data gets cached. And every time you generate a new token you are, essentially, tapping into that cache data to figure out what the next best token needs to be.\u201d<\/p>\n<p>With AI agents, \u201cthat problem gets multiplied \u2014 10x or 100x. So, the memory footprint becomes very, very very important, and keeping that memory close to compute becomes very important. The less distance the data has to travel to get to the compute, the faster your inference is going to be. And the more optimal your inference is going to be, the lower cost your inference is going to be.\u201d<\/p>\n<p>In this episode, Sheth discussed and showcased d-Matrix\u2019s AI inference platform, Corsair, which takes an innovative approach to architecting and locating memory and compute. d-Matrix builds specialized chiplets, he said, \u201cand then we co-package these chiplets into a fabric, and that gives us that elasticity and modularity in the platform. We can always scale it up or scale it down, depending on the customer\u2019s requirements.\u201d<\/p>\n<p>In Corsair, memory and compute are layered directly on top of each other \u2014 like a stack of pancakes \u2014 cutting the travel distance down significantly. \u201cThe data is sitting inside this memory, and it\u2019s raining down into the compute, which is sitting right underneath it,\u201d Sheth said. \u201cThe surface area is much greater when you package things this way. Obviously, there\u2019s more surface area between the memory and the compute, and a lot more data can drop down into the compute.\u201d<\/p>\n<p>Check out the full episode to learn more about inference, why it needs different infrastructure than AI model training, and what Sheth sees ahead for AI infra more generally.<\/p>\n<p>\t<a class=\"row youtube-subscribe-block\" href=\"https:\/\/youtube.com\/thenewstack?sub_confirmation=1\" target=\"_blank\" rel=\"nofollow noopener\"><\/p>\n<p>\n\t\t\t\tYOUTUBE.COM\/THENEWSTACK\n\t\t\t<\/p>\n<p>\n\t\t\t\tTech moves fast, don&#8217;t miss an episode. Subscribe to our YouTube<br \/>\n\t\t\t\tchannel to stream all our podcasts, interviews, demos, and more.\n\t\t\t<\/p>\n<p>\t\t\t\tSUBSCRIBE<\/p>\n<p>\t<\/a><\/p>\n<p>    Group<br \/>\n    Created with Sketch.<\/p>\n<p>\t\t<a href=\"https:\/\/thenewstack.io\/author\/hjoslyn\/\" class=\"author-more-link\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t<img decoding=\"async\" class=\"post-author-avatar\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/7bbd1cfd-cropped-4b732d2f-heatherjoslyn.jpg\"\/><\/p>\n<p>\n\t\t\t\t\t\t\tHeather Joslyn\u00a0is editor in chief of The New Stack, with a special interest in management and careers issues that are relevant to software developers and engineers. She previously worked as editor in chief of Container Solutions, a Cloud Native consulting&#8230;\t\t\t\t\t\t<\/p>\n<p>\t\t\t\t\t\tRead more from Heather Joslyn\t\t\t\t\t\t<\/p>\n<p>\t\t<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"The computing demands of training AI models may get a lot of the attention from the tech industry&hellip;\n","protected":false},"author":2,"featured_media":49078,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[64,63,257,3755,105,446],"class_list":{"0":"post-49077","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-computing","8":"tag-au","9":"tag-australia","10":"tag-computing","11":"tag-podcast","12":"tag-technology","13":"tag-video"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/49077","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=49077"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/49077\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/49078"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=49077"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=49077"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=49077"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}