{"id":228584,"date":"2025-10-20T23:18:14","date_gmt":"2025-10-20T23:18:14","guid":{"rendered":"https:\/\/www.newsbeep.com\/au\/228584\/"},"modified":"2025-10-20T23:18:14","modified_gmt":"2025-10-20T23:18:14","slug":"alibaba-cloud-says-it-cut-nvidia-ai-gpu-use-by-82-with-new-pooling-system-up-to-9x-increase-in-output-lets-213-gpus-perform-like-1192","status":"publish","type":"post","link":"https:\/\/www.newsbeep.com\/au\/228584\/","title":{"rendered":"Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system\u2014 up to 9x increase in output lets 213 GPUs perform like 1,192"},"content":{"rendered":"<p id=\"caf228b5-5290-4a25-9b2c-43d6badcb8b3\">Alibaba Cloud claims its new Aegaeon pooling system reduces the number of Nvidia GPUs required to serve large language models by 82% during a multi-month beta test inside its Model Studio marketplace. The result, published in a <a data-analytics-id=\"inline-link\" href=\"https:\/\/ennanzhai.github.io\/pub\/sosp25-aegaeon.pdf\" target=\"_blank\" rel=\"nofollow noopener\">peer-reviewed paper<\/a> presented at the 2025 ACM Symposium on Operating Systems (SOSP) in Seoul, suggests that cloud providers may be able to extract significantly more inference capacity from existing silicon, especially in constrained markets like China, where the <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.tomshardware.com\/pc-components\/gpus\/china-repurposes-used-nvidia-gpus\" data-before-rewrite-localise=\"https:\/\/www.tomshardware.com\/pc-components\/gpus\/china-repurposes-used-nvidia-gpus\" rel=\"nofollow noopener\" target=\"_blank\">supply of Nvidia&#8217;s latest H20s<\/a> remains limited.<\/p>\n<p>Unlike training-time breakthroughs that chase model quality or speed, Aegaeon is an inference-time scheduler designed to maximize GPU utilization across many models with bursty or unpredictable demand. Instead of pinning one accelerator to one model, Aegaeon virtualizes GPU access at the token level, allowing it to schedule tiny slices of work across a shared pool. This means one H20 could serve several different models simultaneously, with system-wide \u201cgoodput\u201d \u2014 a measure of effective output \u2014 rising by as much as nine times compared to older serverless systems.<\/p>\n<p><a id=\"elk-seasonal\"\/><\/p>\n<p id=\"caf228b5-5290-4a25-9b2c-43d6badcb8b3-2\">The system was tested in production over several months, according to the paper, which lists authors from both Peking University and Alibaba\u2019s infrastructure division, including CTO Jingren Zhou. During that window, the number of GPUs needed to support dozens of different LLMs \u2014 ranging in size up to 72 billion parameters \u2014 fell from 1,192 to just 213.<\/p>\n<p>While the paper does not break down which models contributed most to the savings, reporting by the <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.scmp.com\/business\/article\/3329450\/alibaba-cloud-claims-slash-nvidia-gpu-use-82-new-pooling-system?module=top_story&amp;pgtype=section\" rel=\"nofollow noopener\" target=\"_blank\">South China Morning Post<\/a> says the tests were conducted using Nvidia\u2019s H20, one of the <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.tomshardware.com\/tech-industry\/jensen-huang-says-nvidia-china-market-share-has-fallen-to-zero\" data-before-rewrite-localise=\"https:\/\/www.tomshardware.com\/tech-industry\/jensen-huang-says-nvidia-china-market-share-has-fallen-to-zero\" rel=\"nofollow noopener\" target=\"_blank\">few accelerators<\/a> still legally available to Chinese buyers under current U.S. export controls.<\/p>\n<p class=\"paywall\" aria-hidden=\"true\">Alibaba says the gains came from two main techniques: Packing multiple models per GPU, and using a token-level autoscaler to dynamically allocate compute as output is generated, rather than reserving resources at the request level. In <a data-analytics-id=\"inline-link\" href=\"https:\/\/www.tomshardware.com\/tag\/benchmark\" data-auto-tag-linker=\"true\" data-before-rewrite-localise=\"https:\/\/www.tomshardware.com\/tag\/benchmark\" rel=\"nofollow noopener\" target=\"_blank\">benchmarks<\/a>, Aegaeon beat the goodput of ServerlessLLM and MuxServe by margins ranging from 1.5 times to 9 times.<\/p>\n<p>Whether those savings translate outside Alibaba\u2019s stack remains to be seen. Alibaba Cloud\u2019s paper does not specify the exact network fabric used in the beta test, but we know the company offers its own eRDMA elastic RDMA network and has a record of building highly\u2011integrated GPU serving stacks, suggesting the results may depend on an optimized, vertically integrated environment.<\/p>\n<p class=\"paywall\" aria-hidden=\"true\">Regardless, the result is likely to attract interest from other hyperscalers looking to stretch scarce accelerator fleets as inference demand continues to spike.<\/p>\n<p><a href=\"https:\/\/news.google.com\/publications\/CAAqLAgKIiZDQklTRmdnTWFoSUtFSFJ2YlhOb1lYSmtkMkZ5WlM1amIyMG9BQVAB\" id=\"91c68ebe-b326-462c-8a81-636b58803280\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p class=\"vanilla-image-block\" style=\"padding-top:31.51%;\">\n<p><img decoding=\"async\" src=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/10\/7cUTDmN2PHNRiNBVqbKf56.png\" alt=\"Google Preferred Source\"   loading=\"lazy\" data-new-v2-image=\"true\" data-original-mos=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/10\/7cUTDmN2PHNRiNBVqbKf56.png\" data-pin-media=\"https:\/\/www.newsbeep.com\/au\/wp-content\/uploads\/2025\/10\/7cUTDmN2PHNRiNBVqbKf56.png\" class=\"pull-left\"\/>\n<\/p>\n<p><\/a><\/p>\n<p id=\"de01fbb3-3508-4f25-a524-943c5df08f57\">Follow<a data-analytics-id=\"inline-link\" href=\"https:\/\/news.google.com\/publications\/CAAqLAgKIiZDQklTRmdnTWFoSUtFSFJ2YlhOb1lYSmtkMkZ5WlM1amIyMG9BQVAB\" target=\"_blank\" rel=\"nofollow noopener\"> Tom&#8217;s Hardware on Google News<\/a>, or<a data-analytics-id=\"inline-link\" href=\"https:\/\/google.com\/preferences\/source?q=\" target=\"_blank\" rel=\"nofollow noopener\"> add us as a preferred source<\/a>, to get our latest news, analysis, &amp; reviews in your feeds.<\/p>\n","protected":false},"excerpt":{"rendered":"Alibaba Cloud claims its new Aegaeon pooling system reduces the number of Nvidia GPUs required to serve large&hellip;\n","protected":false},"author":2,"featured_media":228585,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[256,254,255,64,63,105],"class_list":{"0":"post-228584","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-artificial-intelligence","10":"tag-artificialintelligence","11":"tag-au","12":"tag-australia","13":"tag-technology"},"_links":{"self":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/228584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/comments?post=228584"}],"version-history":[{"count":0,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/posts\/228584\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media\/228585"}],"wp:attachment":[{"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/media?parent=228584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/categories?post=228584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.newsbeep.com\/au\/wp-json\/wp\/v2\/tags?post=228584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}