{"id":98375,"date":"2025-08-27T00:15:17","date_gmt":"2025-08-27T00:15:17","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/98375\/"},"modified":"2025-08-27T00:15:17","modified_gmt":"2025-08-27T00:15:17","slug":"d-matrix-corsair-in-memory-computing-for-ai-inference-at-hot-chips-2025","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/98375\/","title":{"rendered":"d-Matrix Corsair In-Memory Computing For AI Inference at Hot Chips 2025"},"content":{"rendered":"<p>            <a href=\"https:\/\/www.servethehome.com\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-8.jpg\" data-caption=\"Card Level Scaleup: 16 Chiplet Hierarchical All-to-All\" rel=\"nofollow noopener\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" width=\"696\" height=\"392\" class=\"entry-thumb td-modal-image\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-8-696x392.jpg\"   alt=\"Card Level Scaleup: 16 Chiplet Hierarchical All-to-All\" title=\"Card Level Scaleup: 16 Chiplet Hierarchical All-to-All\"\/><\/a>Card Level Scaleup: 16 Chiplet Hierarchical All-to-All<\/p>\n<p>The second machine learning presentation of the afternoon comes from d-Matrix. The company specializes in hardware for AI inference, and as of late has been tackling the matter of how to improve inference performance by using in-memory computing. Along those lines, the company is presenting its Corsair in-memory computing chiplet architecture at Hot Chips. As a quick note: we covered\u00a0<a href=\"https:\/\/www.servethehome.com\/d-matrix-pavehawk-brings-3dimc-to-challenge-hbm-for-ai-inference\/\" rel=\"nofollow noopener\" target=\"_blank\">d-Matrix Pavehawk Brings 3DIMC to Challenge HBM for AI Inference<\/a> a few days ago.<\/p>\n<p>Not to be confused with <a href=\"https:\/\/www.corsair.com\" rel=\"nofollow noopener\" target=\"_blank\">that Corsair<\/a>, d-Matrix claims that Corsair is the most efficient inference platform on the market, thanks to its combination of in-memory computing and low-latency interconnects.<\/p>\n<p><img fetchpriority=\"high\" decoding=\"async\" class=\"size-large wp-image-89723\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-2-800x450.jpg\" alt=\"\u201cRethinking\u201d AI Inference\" width=\"696\" height=\"392\"  \/>\u201cRethinking\u201d AI Inference<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89724\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-3-800x450.jpg\" alt=\"LLM Token-generation is Memory Bound\" width=\"696\" height=\"392\"  \/>LLM Token-generation is Memory Bound<\/p>\n<p>Each token in an LLM is memory bound. All the weights need to be read. Batching allows for these weight fetches to be amortized.<\/p>\n<p>d-Matrix\u2019s goal is to reach saturation at moderate batch sizes in order to hit specific latency targets.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89725\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-4-800x450.jpg\" alt=\"Voice is latency-critical and even more Memory Bound\" width=\"696\" height=\"392\"  \/>Voice is latency-critical and even more Memory Bound<\/p>\n<p>Real-time voice requires very low latency. Making it a good target for d-Matrix\u2019s technology.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89726\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-5-800x450.jpg\" alt=\"AI Agents: SLM &amp; Rise of Inference-Time Compute\" width=\"696\" height=\"392\"  \/>AI Agents: SLM &amp; Rise of Inference-Time Compute<\/p>\n<p>AI agents fall into the same boat. Multiple small models being executed to accomplish the desired task.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89727\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-6-800x450.jpg\" alt=\"d-Matrix Corsair: Chiplet-based Inference Acceleration Platform\" width=\"696\" height=\"392\"  \/>d-Matrix Corsair: Chiplet-based Inference Acceleration Platform<\/p>\n<p>And here is Corair, d-Matrix\u2019s accelerator. Two chips, each with 4 chiplets. Built on TSMC 6nm. 2GB of SRAM between all of the chiplets. This is a PCIe 5.0 x16 card, so it can be easily added to standard servers.<\/p>\n<p>Meanwhile at the top of the card are bridge connectors to tie together multiple cards.<\/p>\n<p>Each chiplet interfaces with LPDDR5X, with 256GB of L5X per card.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89728\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-7-800x450.jpg\" alt=\"Corsair Chiplet\" width=\"696\" height=\"392\"  \/>Corsair Chiplet<\/p>\n<p>And here is how the chiplets are organized into slices. Around the edge are LPDDR and D2D connections. As well as 16 lanes of PCIe.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89729\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-8-800x450.jpg\" alt=\"Card Level Scaleup: 16 Chiplet Hierarchical All-to-All\" width=\"696\" height=\"392\"  \/>Card Level Scaleup: 16 Chiplet Hierarchical All-to-All<\/p>\n<p>Two cards can be passively bridged together, making for a 16 chiplet cluster, with all-to-all connectivity.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89730\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-9-800x450.jpg\" alt=\"System and Scaleup Architecture\" width=\"696\" height=\"392\"  \/>System and Scaleup Architecture<\/p>\n<p>8 cards in turn can go into a standard server, such as a Supermicro X14. In this example there are also 4 NIC cards to offer scale-up capabilities.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89731\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-10-800x450.jpg\" alt=\"Corsair Key Pillars \u2013 Low Latency, Batched Throughput\" width=\"696\" height=\"392\"  \/>Corsair Key Pillars \u2013 Low Latency, Batched Throughput<\/p>\n<p>Corsair was built for low-latency batch throughput inference.<\/p>\n<p>They support block floating point number formats. Energy efficiency is 38 TOPS per Watt.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89733\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-12-800x450.jpg\" alt=\"Corsair Chiplet Built Using Modular Hardware Blocks\" width=\"696\" height=\"392\"  \/>Corsair Chiplet Built Using Modular Hardware Blocks<\/p>\n<p>The dispatch engine within each chiplet is based on RISC-V. 1 chiplet is split up into 4 quads. About 1TB\/sec of D2D bandwidth.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89734\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-13-800x450.jpg\" alt=\"Energy Efficient DIMC Architecture\" width=\"696\" height=\"392\"  \/>Energy Efficient DIMC Architecture<\/p>\n<p>Diving deeper, the matrix multiplier inside Corsair can perform a 64\u00d764 matmul with INT8. Or 64\u00d7128 with INT4.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89735\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-14-800x450.jpg\" alt=\"Corsair supports 5x Weight Compression\" width=\"696\" height=\"392\"  \/>Corsair supports 5x Weight Compression<\/p>\n<p>Corsair also supports FP formats with scale factors. As well as structured sparsity \u2013 though it\u2019s only used for compression. Overall, it gets d-Matrix to 5x compression.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89736\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-15-800x450.jpg\" alt=\"Core Architecture\" width=\"696\" height=\"392\"  \/>Core Architecture<\/p>\n<p>All 8 matrix units can be tied together.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89737\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-16-800x450.jpg\" alt=\"Dataflow with Block Floating-Point Numerical Formats\" width=\"696\" height=\"392\"  \/>Dataflow with Block Floating-Point Numerical Formats<\/p>\n<p>Dataflow. Accumulated on the fly and then converted to the desired output format.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89711\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-17-800x450.jpg\" alt=\"Memory System: Global Memory, Stash, and LPDDR\" width=\"696\" height=\"392\"  \/>Memory System: Global Memory, Stash, and LPDDR<\/p>\n<p>As for memory, there is a stash memory that feeds the cores. Each stash is 6MB. There are 2 LPDDR channels per chiplet.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89712\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-18-800x450.jpg\" alt=\"Scaling Challenges for Large-Model Inference\" width=\"696\" height=\"392\"  \/>Scaling Challenges for Large-Model Inference<\/p>\n<p>When you have high memory bandwidth, the collective latency becomes increasingly critical.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89713\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-19-800x450.jpg\" alt=\"Corsair Scaleup \u2013 Hardware-Software Codesign\" width=\"696\" height=\"392\"  \/>Corsair Scaleup \u2013 Hardware-Software Codesign<\/p>\n<p>So in order to do a 16 chiplet all-to-all connection, d-Matrix got latency down to 115ns D2D. Even going through PCIe switches, they can still hold latency to 650ns.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89714\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-20-800x450.jpg\" alt=\"Package Level Scaleup: 4 Chiplet All-to-All\" width=\"696\" height=\"392\"  \/>Package Level Scaleup: 4 Chiplet All-to-All<\/p>\n<p>Another shot of Corsair chiplets on an organic package.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89715\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-21-800x450.jpg\" alt=\"Transparent NIC \u2013 Ethernet Scale-Out\" width=\"696\" height=\"392\"  \/>Transparent NIC \u2013 Ethernet Scale-Out<\/p>\n<p>And here is the NIC that d-Matrix uses for scale-out fabrics. 2us of latency.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89716\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-22-800x450.jpg\" alt=\"Rack Level Scale-Out: Multi-node and Multi-Rack\" width=\"696\" height=\"392\"  \/>Rack Level Scale-Out: Multi-node and Multi-Rack<\/p>\n<p>Using this, d-Matrix can rack and stack many servers.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89717\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-23-800x450.jpg\" alt=\"Aviator Software: Easy to use and optimized for Corsair\" width=\"696\" height=\"392\"  \/>Aviator Software: Easy to use and optimized for Corsair<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89718\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-24-800x450.jpg\" alt=\"Aviator Software: Codesigned for LLM Acceleration\" width=\"696\" height=\"392\"  \/>Aviator Software: Codesigned for LLM Acceleration<\/p>\n<p>And no inference accelerator would be complete without a matching software stack to enable the hardware and its features.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89719\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-25-800x450.jpg\" alt=\"Power Efficiency (TOPS\/W)\" width=\"696\" height=\"392\"  \/>Power Efficiency (TOPS\/W)<\/p>\n<p>And here\u2019s a look at power consumption. 275W @ 800MHz. Meanwhile 1.2GHz chugs 550W. Higher clockspeeds are worse for overall efficiency, but not immensely so.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89720\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-26-800x450.jpg\" alt=\"Performance and Flexibility for Use Cases\" width=\"696\" height=\"392\"  \/>Performance and Flexibility for Use Cases<\/p>\n<p>And here are some Llama3 performance figures. The time per output token is just 2ms even for the larger Llama3-70B.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89721\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-27-800x450.jpg\" alt=\"Stacking of logic on DRAM interposer in 3D\" width=\"696\" height=\"392\"  \/>Stacking of logic on DRAM interposer in 3D<\/p>\n<p>Underneath the chip, d-Matrix uses a silicon interposer with capacitor for power reliabiltiy reasons. d-Matrix goes one further and 3D stacks DRAM beneath their Corsair chiplets, keeping the local memory very, very close.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-89722\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/08\/24_dMatrix_Bhoja_final-28-800x450.jpg\" alt=\"3D DRAM Test Vehicle\" width=\"696\" height=\"392\"  \/>3D DRAM Test Vehicle<\/p>\n<p>And they have a prototype 3D DRAM test vehicle that\u2019s been built. 36 micron D2D stacking. The logic die sits on top, while the DRAM sits underneath.<\/p>\n<p>How does d-Matrix make stacked DRAM + logic work? Keep the heat density to under 0.3W\/mm2, which keeps from heating up the DRAM too much.<\/p>\n","protected":false},"excerpt":{"rendered":"Card Level Scaleup: 16 Chiplet Hierarchical All-to-All The second machine learning presentation of the afternoon comes from d-Matrix.&hellip;\n","protected":false},"author":2,"featured_media":98376,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[64,63,257,72057,72058,105],"class_list":{"0":"post-98375","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-computing","8":"tag-au","9":"tag-australia","10":"tag-computing","11":"tag-d-matrix","12":"tag-hot-chips-2025","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/98375","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=98375"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/98375\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/98376"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=98375"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=98375"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=98375"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}