Alibaba Group Holding has introduced a computing pooling solution that it said led to an 82 per cent cut in the number of Nvidia graphics processing units (GPUs) needed to serve its artificial intelligence models.
The system, called Aegaeon, was beta tested in Alibaba Cloud’s model marketplace for more than three months, where it reduced the number of Nvidia H20 GPUs required to serve dozens of models of up to 72 billion parameters from 1,192 to 213, according to a research paper presented this week at the 31st Symposium on Operating Systems Principles (SOSP) in Seoul, South Korea.
“Aegaeon is the first work to reveal the excessive costs associated with serving concurrent LLM workloads on the market,” the researchers from Peking University and Alibaba Cloud wrote.
Do you have questions about the biggest topics and trends from around the world? Get the answers with SCMP Knowledge, our new platform of curated content with explainers, FAQs, analyses and infographics brought to you by our award-winning team.
Alibaba Cloud is the AI and cloud services unit of Hangzhou-based Alibaba, which owns the Post. Its chief technology officer, Zhou Jingren, is one of the paper’s authors.
Cloud services providers, such as Alibaba Cloud and ByteDance’s Volcano Engine, serve thousands of AI models to users concurrently, meaning that many application programming interface calls are handled at the same time.
However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
Researchers globally have sought to improve efficiency by pooling GPU power, allowing one GPU to serve multiple models, for instance.
Under Aegaeon, the researchers designed a solution that performs “auto-scaling” at the token level, meaning that a GPU can switch between serving different models in the middle of generating tokens, or the basic units of data processed by AI systems.
The solution allowed a single GPU to support up to seven models, compared to a maximum of two or three models under alternative systems, and slashed latency associated with switching between models by 97 per cent, the researchers said.
The research showed how Chinese tech firms were optimising software to offset reliance on US chips amid tightening export controls. Photo: Shutterstock alt=The research showed how Chinese tech firms were optimising software to offset reliance on US chips amid tightening export controls. Photo: Shutterstock>
In a post on Thursday, Alibaba Cloud said the solution has been implemented in its model marketplace called Bailian, which serves the company’s own Qwen models to corporate users.
US chipmaking giant Nvidia had developed the H20 GPU, which is especially suited for model inference, specifically for the Chinese market in response to Biden-era export controls.
However, the chip has since come under investigation from Chinese regulators over alleged backdoor security risks.
At the same time, major Chinese chip designers, such as Huawei Technologies and Cambricon Technologies, have stepped up efforts to develop domestic GPUs in response to Beijing’s self-reliance strategy.
Earlier this month, Nvidia founder and CEO Jensen Huang said the company’s position in China’s advanced chip market had dropped to “zero”.
This article originally appeared in the South China Morning Post (SCMP), the most authoritative voice reporting on China and Asia for more than a century. For more SCMP stories, please explore the SCMP app or visit the SCMP’s Facebook and Twitter pages. Copyright © 2025 South China Morning Post Publishers Ltd. All rights reserved.
Copyright (c) 2025. South China Morning Post Publishers Ltd. All rights reserved.