The answer to and source of all your DC dilemmas • The Register

Interview In the datacenter biz, power is the product. You either have it or you don’t, Chris Sharp tells El Reg.

The CTO of colocation provider Digital Realty explains that without power, there are no servers, no storage, no GPUs, and none of those AI tokens that have Wall Street in a frency. But power isn’t only the limiting factor in the US and much of the world, it has also upended the way datacenters are designed and built.

Over the past few years, GPU servers have transitioned from air-cooled machines that didn’t require much if any additional work to deploy in a typical datacenter to something more reminiscent of the bespoke HPC clusters built by the likes of Cray, Eviden, or Lenovo.

This change didn’t take place overnight. Nvidia’s Ampere generation of GPUs introduced in 2020 didn’t really require a fundamental shift in the company’s approach to cooling or thermal management, Sharp told us.

But in the intervening years the winds were changing. GPU servers were not only growing more power hungry but were in now in high demand, and deploying them was no longer as simple as racking and stacking servers.

The chief constraints: power and density. In the past five years we’ve gone from air-cooled systems that might pull 6 or 7 kilowatts under load to liquid-cooled rack-scale behemoths with more than 120 kW of compute on board.

“More times than not, customers are like, ‘Okay, I broke through, and I’m free of the supply constraint, I have my chips,’ and I have to say: slow down; there’s a lot of other things you’re going to need,” Sharp said.

For X amount of GPUs you now need so many switches, storage servers, power delivery units, and coolant distribution units. In the case of Nvidia’s densest systems, existing datacenters may not even be able to support the physical load.

“Silicon innovation is going to be hampered by the permanence of concrete in the datacenter,” Sharp said. “There’s a potential to have bought your infrastructure, and you do not have a place where it goes”

There are a lot of colocation providers that can’t handle this level of densification, or, if they can, that doesn’t mean they’re prepared to support the next generation of compute platforms, he adds.

This is a problem for hardware vendors like Nvidia and AMD, who believe that as Moore’s Law slows and advancements in silicon density and energy efficiency become fewer and further between, the best path forward is packing larger and larger chips closer together.

Today, Nvidia’s rack systems are hovering around 140kW in compute capacity. But we’ve yet to reach a limit. By 2027, Nvidia plans to launch 600kW racks which pack 576 GPU dies into the space one occupied by just 32.

Nvidia CEO Jensen Huang announced its Vera Rubin Ultra platform and Kyber racks at GTC in March partly to move the market forward: the infrastructure required to support large-scale deployments of these racks simply didn’t exist yet, and needed to be built.

Sharp is fairly confident that Digital Realty’s existing facilities will be able to accommodate small deployments of Nvidia’s Kyber racks, which he refers to as tuck-ins. Where things get complicated is with larger deployments — those, he says, will require a different kind of datacenter.

To get ahead of this trend toward denser AI deployments, Digital Realty announced a research center in collaboration with Nvidia in October.

The facility, located in Manassas, Virginia, aims to develop a new kind of datacenter, which Nvidia CEO Jensen Haung has taken to calling AI factories, that consumes power and churn out tokens in return.

The facility will be among the first to feature Nvidia’s Vera Rubin family of GPUs, which are slated to make their debut sometime next year, and will provide Digital Realty the infrastructure necessary to test, validate, and refine new datacenter architectures for power delivery, thermal management, and networking.

To support this, Digital Realty is also working with Nvidia and its partners on a new software platform called Omniverse DSX, which uses digital twins to simulate gigawatt-scale datacenters and allow for rapid prototyping of modular systems, thermal management systems, networking and power delivery.

While datacenters depend on a steady supply of power, the spikier nature of AI infrastructure also poses a challenge to grid operators. To manage this, the colocation provider is working with Nvidia and startup Emerald AI to develop a grid-flexible power management system which will enable datacenter operators to proactively respond to grid conditions.

“What we’re trying to do is certify designs and really create a blueprint for the broader community,” Sharp said. ®

The answer to and source of all your DC dilemmas • The Register

Tags: