{"id":230767,"date":"2025-10-17T05:20:20","date_gmt":"2025-10-17T05:20:20","guid":{"rendered":"https:\/\/www.newsbeep.com\/us\/230767\/"},"modified":"2025-10-17T05:20:20","modified_gmt":"2025-10-17T05:20:20","slug":"two-nvidia-dgx-spark-systems-fused-with-m3-ultra-mac-studio-to-deliver-2-8x-gain-in-ai-benchmarks-exo-labs-demonstrates-disaggregated-ai-inference-serving","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/us\/230767\/","title":{"rendered":"Two Nvidia DGX Spark systems fused with M3 Ultra Mac Studio to deliver 2.8x gain in AI benchmarks \u2014 EXO Labs demonstrates disaggregated AI inference serving"},"content":{"rendered":"<p id=\"6598b0e9-47e8-4f7c-8d59-bd760c9e9d6d\">EXO Labs&#8217; primary project is EXO, an open-source framework for running large language models (LLMs) efficiently across mixed hardware setups. Rather than treating inference as a task bound to a single GPU or accelerator, EXO automatically spreads workloads across whatever devices you have\u2014turning a cluster of desktops, laptops, workstations, servers, tablets, or even smartphones into a cooperative AI mesh. EXO&#8217;s newest demo combines two of <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.tomshardware.com\/desktops\/mini-pcs\/nvidias-dgx-spark-ai-mini-pc-goes-up-for-sale-october-15-1-petaflop-developer-platform-was-originally-slated-for-may\" target=\"_blank\" data-before-rewrite-localise=\"https:\/\/www.tomshardware.com\/desktops\/mini-pcs\/nvidias-dgx-spark-ai-mini-pc-goes-up-for-sale-october-15-1-petaflop-developer-platform-was-originally-slated-for-may\" rel=\"nofollow noopener\">NVIDIA&#8217;s DGX Spark systems<\/a> with Apple&#8217;s M3 Ultra\u2013powered Mac Studio to make use of the disparate strengths of each machine: Spark has more raw compute muscle, while the Mac Studio can move data around much faster. EXO 1.0, currently in early access, blends the two into a single inference pipeline, and it <a data-analytics-id=\"inline-link\" href=\"https:\/\/blog.exolabs.net\/nvidia-dgx-spark\/\" target=\"_blank\" data-url=\"https:\/\/blog.exolabs.net\/nvidia-dgx-spark\/\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"nofollow noopener\">apparently works shockingly well<\/a>.<\/p>\n<p class=\"vanilla-image-block\" style=\"padding-top:88.93%;\">\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/NyEnaeZBerJ22jadniX2UG.jpg\" alt=\"Two NVIDIA DGX Spark systems linked up with a Mac Studio using 10-Gigabit Ethernet for disaggregated AI inference.\"   loading=\"lazy\" data-new-v2-image=\"true\" data-original-mos=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/NyEnaeZBerJ22jadniX2UG.jpg\" data-pin-media=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/NyEnaeZBerJ22jadniX2UG.jpg\" class=\"expandable\"\/><br \/>\n<a href=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/NyEnaeZBerJ22jadniX2UG.jpg\" target=\"_blank\" class=\"expand-button icon-expand-image icon\" data-url=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/NyEnaeZBerJ22jadniX2UG.jpg\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\"><\/p>\n<p>Two NVIDIA DGX Spark systems linked up with a Mac Studio using 10-Gigabit Ethernet for disaggregated AI inference. (Image credit: EXO Labs)<\/p>\n<p id=\"b5403d9d-7218-4c9c-b81b-0b88e04bc37e\" class=\"paywall\" aria-hidden=\"true\">To understand how, you need to know that LLM inference has two distinct phases: first, the prefill stage, when the model reads and processes the user&#8217;s prompt. This part is compute-bound, meaning it benefits from powerful GPUs like the Blackwell part in the DGX Spark. The decode phase follows, generating tokens one by one. That step is heavily bandwidth-bound, making the M3 Ultra&#8217;s super-wide memory bus ideal. EXO&#8217;s trick is to divide those phases between machines, streaming the model&#8217;s internal data (called the KV cache) layer by layer so the two systems can work simultaneously instead of waiting on each other.<\/p>\n<p class=\"vanilla-image-block\" style=\"padding-top:42.13%;\">\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/4UkkNbcqDrNH5oneT8vcDM.png\" alt=\"A screenshot of the EXO Labs website showing benchmark results.\"   loading=\"lazy\" data-new-v2-image=\"true\" data-original-mos=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/4UkkNbcqDrNH5oneT8vcDM.png\" data-pin-media=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/4UkkNbcqDrNH5oneT8vcDM.png\"\/>\n<\/p>\n<p>The performance results from <a href=\"https:\/\/blog.exolabs.net\/nvidia-dgx-spark\/\" target=\"_blank\" data-url=\"https:\/\/blog.exolabs.net\/nvidia-dgx-spark\/\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"nofollow noopener\">EXO Labs&#8217; blog post<\/a>. (Image credit: EXO Labs)<\/p>\n<p id=\"d2f300ec-cbdd-45bd-96a1-4f7354bfc130\">In EXO&#8217;s benchmark with Meta&#8217;s Llama-3.1 8B model, the hybrid setup achieved nearly a threefold speedup over the Mac Studio alone \u2014 matching the DGX Spark&#8217;s prefill speed while keeping the M3 Ultra&#8217;s quick generation time. The result is a 2.8\u00d7 overall gain, and that was with an 8K-token prompt on a relatively modest 8B model. Longer prompts or <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.tomshardware.com\/tech-industry\/artificial-intelligence\/deepseeks-ai-breakthrough-bypasses-industry-standard-cuda-uses-assembly-like-ptx-programming-instead\" target=\"_blank\" data-before-rewrite-localise=\"https:\/\/www.tomshardware.com\/tech-industry\/artificial-intelligence\/deepseeks-ai-breakthrough-bypasses-industry-standard-cuda-uses-assembly-like-ptx-programming-instead\" rel=\"nofollow noopener\">larger models<\/a> should see even greater gains.<\/p>\n<p>You may like<\/p>\n<p>This kind of &#8220;disaggregated inference&#8221; isn&#8217;t exactly a novelty, but it&#8217;s still very clever. It hints at a future where AI performance scales not by buying one massive accelerator, but instead by more intelligently orchestrating the hardware you already have. NVIDIA seems to agree: its <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.tomshardware.com\/tech-industry\/semiconductors\/nvidia-rubin-cpx-forms-one-half-of-new-disaggregated-ai-inference-architecture-approach-splits-work-between-compute-and-bandwidth-optimized-chips-for-best-performance\" target=\"_blank\" data-before-rewrite-localise=\"https:\/\/www.tomshardware.com\/tech-industry\/semiconductors\/nvidia-rubin-cpx-forms-one-half-of-new-disaggregated-ai-inference-architecture-approach-splits-work-between-compute-and-bandwidth-optimized-chips-for-best-performance\" rel=\"nofollow noopener\">upcoming Rubin CPX platform<\/a> will use compute-dense Rubin CPX processors for the context-building prefill stage while standard Rubin chips with huge HBM3e memory bandwidth handle the decode stage\u2014he same principle EXO is already demonstrating on off-the-shelf hardware.<\/p>\n<p class=\"vanilla-image-block\" style=\"padding-top:41.27%;\">\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/Kv7RGTun3RPevczTgCSah.png\" alt=\"A diagram illustrating the difference between traditional and disaggregated serving for AI inference.\"   loading=\"lazy\" data-new-v2-image=\"true\" data-original-mos=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/Kv7RGTun3RPevczTgCSah.png\" data-pin-media=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/Kv7RGTun3RPevczTgCSah.png\"\/>\n<\/p>\n<p>NVIDIA&#8217;s <a href=\"https:\/\/developer.nvidia.com\/blog\/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models\/\" target=\"_blank\" data-url=\"https:\/\/developer.nvidia.com\/blog\/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models\/\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"nofollow noopener\">Dynamo framework<\/a> serves a similar purpose, but without the automatic device subscription feature of EXO. (Image credit: NVIDIA)<\/p>\n<p id=\"80fb4225-3c10-4448-ae9a-b0f39479ad62\">EXO&#8217;s early-access release is still in the experimental stages. The <a data-analytics-id=\"inline-link\" href=\"https:\/\/github.com\/exo-explore\/exo\" target=\"_blank\" data-url=\"https:\/\/github.com\/exo-explore\/exo\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"nofollow noopener\">current open-source version<\/a> (0.0.15-alpha) dates back to March 2025, and the full 1.0 build \u2014 with automated scheduling, KV streaming, and heterogeneous optimizations \u2014 isn&#8217;t public yet. It&#8217;s not plug-and-play software for consumers, at least not yet; for now, it&#8217;s a research-grade tool proving that disaggregated inference can deliver real gains.<\/p>\n<p>Even so, it&#8217;s an exciting proof of concept. By making intelligent use of mixed hardware, EXO shows that high-performance AI doesn&#8217;t have to be <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.tomshardware.com\/tech-industry\/artificial-intelligence\/groups-including-blackrock-microsoft-nvidia-and-xai-join-forces-to-acquire-aligned-data-centers-usd40b-deal-delivers-5gw-of-operational-and-planned-data-center-capacity\" target=\"_blank\" data-before-rewrite-localise=\"https:\/\/www.tomshardware.com\/tech-industry\/artificial-intelligence\/groups-including-blackrock-microsoft-nvidia-and-xai-join-forces-to-acquire-aligned-data-centers-usd40b-deal-delivers-5gw-of-operational-and-planned-data-center-capacity\" rel=\"nofollow noopener\">monopolized by data centers<\/a>. It&#8217;s enough to make you wonder about the potential of the devices sitting around your office.<\/p>\n<p><a href=\"https:\/\/news.google.com\/publications\/CAAqLAgKIiZDQklTRmdnTWFoSUtFSFJ2YlhOb1lYSmtkMkZ5WlM1amIyMG9BQVAB\" id=\"7dc7ab37-28a6-4e30-b8fa-0273c60171b0\" data-url=\"https:\/\/news.google.com\/publications\/CAAqLAgKIiZDQklTRmdnTWFoSUtFSFJ2YlhOb1lYSmtkMkZ5WlM1amIyMG9BQVAB\" target=\"_blank\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"nofollow noopener\"><\/p>\n<p class=\"vanilla-image-block\" style=\"padding-top:31.51%;\">\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/7cUTDmN2PHNRiNBVqbKf56.png\" alt=\"Google Preferred Source\"   loading=\"lazy\" data-new-v2-image=\"true\" data-original-mos=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/7cUTDmN2PHNRiNBVqbKf56.png\" data-pin-media=\"https:\/\/www.newsbeep.com\/us\/wp-content\/uploads\/2025\/10\/7cUTDmN2PHNRiNBVqbKf56.png\"\/>\n<\/p>\n<p><\/a><\/p>\n<p id=\"e7c95d9c-76f9-4a7d-8cbc-5e472bcef983\">Follow<a data-analytics-id=\"inline-link\" href=\"https:\/\/news.google.com\/publications\/CAAqLAgKIiZDQklTRmdnTWFoSUtFSFJ2YlhOb1lYSmtkMkZ5WlM1amIyMG9BQVAB\" target=\"_blank\" data-url=\"https:\/\/news.google.com\/publications\/CAAqLAgKIiZDQklTRmdnTWFoSUtFSFJ2YlhOb1lYSmtkMkZ5WlM1amIyMG9BQVAB\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"nofollow noopener\"> Tom&#8217;s Hardware on Google News<\/a>, or<a data-analytics-id=\"inline-link\" href=\"https:\/\/google.com\/preferences\/source?q=\" target=\"_blank\" data-url=\"https:\/\/google.com\/preferences\/source?q=\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\" rel=\"nofollow noopener\"> add us as a preferred source<\/a>, to get our latest news, analysis, &amp; reviews in your feeds.<\/p>\n<p class=\"newsletter-form__strapline\">Get Tom&#8217;s Hardware&#8217;s best news and in-depth reviews, straight to your inbox.<\/p>\n<p><a id=\"elk-seasonal\" class=\"paywall\" aria-hidden=\"true\" href=\"\" data-url=\"\" target=\"_blank\" referrerpolicy=\"no-referrer-when-downgrade\" data-hl-processed=\"none\"\/><\/p>\n","protected":false},"excerpt":{"rendered":"EXO Labs&#8217; primary project is EXO, an open-source framework for running large language models (LLMs) efficiently across mixed&hellip;\n","protected":false},"author":2,"featured_media":230768,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[45],"tags":[182,181,507,74],"class_list":{"0":"post-230767","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/230767","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/comments?post=230767"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/posts\/230767\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media\/230768"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/media?parent=230767"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/categories?post=230767"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/us\/wp-json\/wp\/v2\/tags?post=230767"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}