As Nvidia ships millions of Grace CPUs and Blackwell AI GPUs to data centers worldwide, the company is hard at work bringing up its next-generation AI and HPC platform, Vera Rubin, which is expected to set a new standard for performance and efficiency. Nvidia’s Vera Rubin comprises not one or two, but nine separate processors, each tailored for a particular workload, creating one of the most complex data center platforms ever.

While Nvidia will be disclosing more details about its Vera Rubin over the coming year before it officially launches in late 2025, let’s recap what we already know about the platform, as the company has revealed a fair few details.

Rubin CPX GPU with 128 GB of GDDR7, NVLink 6.0 switch ASIC for scale-up rack-scale connectivity, BlueField-4 DPU with integrated SSD to store key-value cache, Spectrum-6 Photonics Ethernet and Quantum-CX9 1.6 Tb/s Photonics InfiniBand NICs, as well as Spectrum-X Photonics Ethernet and Quantum-CX9 Photonics InfiniBand switching silicon for scale-out connectivity.

You may like

Nvidia

(Image credit: Nvidia/YouTube)

A full NVL144 rack integrates 144 Rubin GPUs (in 72 packages) with 20,736 TB of HBM4 memory and 36 Vera CPUs to deliver up to 3.6 NVFP4 ExaFLOPS for inference and up to 1.2 FP8 ExaFLOPS for training performance. In contrast, NVL144 CPX achieves almost 8 NVFP4 ExaFLOPS for inference using Rubin CPX accelerators, providing even more massive compute density.

uses Arm Neoverse V2 cores) with 2-way simultaneous multithreading, enabling up to 176 threads to run simultaneously. These new Arm v9.2 cores, internally called Olympus, rely on a wide out-of-order pipeline and feature a wide set of optional extensions (SVE2, crypto, FP8/BF16, tagging, RNG, LS64, etc.). Nvidia’s documents indicate that SMT affects per-thread performance — e.g., most pipelines effectively halve per-thread throughput with two threads active, except for a few per-thread-dedicated ones, so developers should decide whether to use SMT for a given workload or keep one thread per core.

Nvidia continues to use its Scalable Coherency Fabric (SCF) within the CPU to tie cores and memory controllers together, but this time, the CPU’s memory bandwidth reaches 1.2 TB/s, 20% higher than Grace. As for system memory, Vera continues to use LPDDR5X, but now uses SOCAMM2 modules for extra density.

You may like

Vera uses NVLink-C2C as the coherent CPU to GPU link, the same technology as Grace-Blackwell, but with higher bandwidth. Grace offers 900 GB/s bidirectional bandwidth, but with the Vera Rubin platform, the bandwidth will double to around 1.8 TB/s per CPU.

The recently released images of the Vera processor show that the CPU does not appear to feature a monolithic design but a multi-chiplet design, as it has visible internal seams. One image shows that the Vera CPU has a distinct I/O chiplet located next to it. Also, the image shows green features emanating from the I/O pads of the CPU die; their purpose is unknown. Perhaps some of Vera’s I/O capabilities are enabled by external chiplets beneath the CPU, but this is merely speculation.

Publicly, there are still big gaps in information about Nvidia’s Vera CPU. There are no official clock speeds, per-core cache sizes, exact L2/L3 topology, or TDP. We also have limited information on NUMA/socket configurations outside the NVL144/NVL576 rack context.

Blackwell Ultra NVL72, which appears ready to cool 1.8 kW GPUs with minor changes to its cooling system.

Nvidia data center GPU roadmap 2025 showing Rubin and Rubin Ultra

(Image credit: Nvidia)

The follow-on refresh, known as the Rubin Ultra platform, is targeted for 2027 and aims to double performance by moving from two compute chiplets to four, which is expected to raise FP4 inference performance to around 100 PFLOPS per GPU package. The memory capacity of Rubin Ultra will also expand dramatically, reaching 1 TB of HBM4E, delivering approximately 32 TB/s of bandwidth. Such a configuration is projected to consume 3.6 kW, requiring an all-new cooling system for GPU packages and an all-new Kyber rack. In fact, Kyber will pack 576 GPUs in 144 GPU packages, thus significantly expanding Nvidia’s scale-up world size.

Both Rubin and Rubin Ultra are expected to rely on TSMC’s CoWoS-L advanced packaging technology. Rubin Ultra — which accommodates four near-reticle compute tiles, two I/O dies, and sixteen HBM4E stacks — likely uses a huge interposer, or combines multiple smaller interposers and stitches them together using bridges.

Rubin CPX GPU, which will also be available for other platforms if their developers need it.

Nvidia Vera Rubin CPX Dual Rack Solution

(Image credit: Nvidia)

Rubin CPX is a specialized inference accelerator built to handle the context-heavy front end of LLM workloads rather than the high-throughput generation phase. The GPU achieves roughly 30 NVFP4 PFLOPS of performance, which is lower than the flagship Rubin parts but well aligned with the workloads CPX is meant to accelerate.

Instead of the expensive and power hungry HBM4 stacks used on ‘full’ Rubin GPUs, CPX comes with 128 GB of GDDR7, which is cheaper, cooler, and does not require advanced packaging. This design gives CPX a large, cost-efficient memory pool for ingesting and encoding long sequences — consisting of hundreds of thousands to millions of tokens — and for managing multi-modal inputs such as video.

In the NVL144 CPX system layout, CPX runs alongside standard Rubin GPUs and the Vera CPU to process specific prefill/context workloads that Nvidia’s Dynamo software automatically offloads from the main GPUs. This division reduces both cost and power requirements for inference at scale, enabling datacenters to deploy large racks with optimal hardware for each stage of a model’s execution.

security, enabling systems to scale without putting an overwhelming load on CPUs.

Nvidia

(Image credit: Nvidia)

BlueField-4 integrates a 64-core Grace-based CPU, high-performance offload engines, and an 800 Gb/s network interface.

Functionally, BlueField-4 offloads the tasks that normally consume CPU cycles in massive clusters: packet processing, encryption/decryption, virtual switching, routing, telemetry, and storage operations such as NVMe-over-Fabrics or data reduction. By handling these operations in hardware, the DPU reduces latency, increases performance, and ensures that GPUs across many nodes stay fed with data. It also runs Nvidia’s DOCA framework, which provides microservices for security policies, multi-tenant isolation, monitoring, and virtualization.

NVSwitch 6.0 inside systems like the NVL144, the full fabric can deliver roughly 28.8 TB/s of aggregate GPU-to-GPU bandwidth.

The 2027 Rubin Ultra refresh will move to NVLink 7.0 and NVSwitch 7.0, keeping the same bidirectional bandwidth per link but likely increasing the port count to around 144 per switch. This expanded switching capacity is what would let Nvidia interconnect the larger four-chiplet Rubin Ultra GPUs at full bandwidth across an entire rack and increase its scale-up world size to 144 GPU packages.

Spectrum-X) and InfiniBand (Quantum-X) technologies on its Rubin generation platform and onwards.

Nvidia

(Image credit: Nvidia)

At a high level, both Nvidia’s Spectrum-X Photonics Ethernet and Quantum-X Photonics InfiniBand platforms are based on TSMC’s 1st Generation COUPE platform that integrates a 65nm electronic integrated circuit (EIC) with a photonic integrated circuit (PIC), delivering up to 1.6 Tb/s per port.

Nvidia’s photonics rollout begins with Quantum-X InfiniBand switches that are set to arrive in early 2026 and aim to deliver 115 Tb/s of fabric bandwidth through 144 lanes at 800 Gb/s, or eventually 576 lanes at 200 Gb/s. These platforms include an integrated processing ASIC capable of 14.4 TFLOPS to accelerate in-network operations via Nvidia’s SHARP v4 protocol to reduce collective-operation latency. All Quantum-X units are designed with liquid cooling to manage their thermal load.

On the Ethernet side, Spectrum-X will ship switches in multiple port-density options: 128×800 Gb/s or 512×200 Gb/s, both offering around 100 Tb/s of aggregate throughput. A larger model increases capacity to 512×800 Gb/s or 2,048×200 Gb/s, reaching 400 Tb/s in total.

Nvidia

(Image credit: Nvidia)

At the recent GTC event in October, Nvidia introduced the ConnectX-9 Spectrum-X SuperNIC, a next-generation 1.6 Tb/s network interface built specifically for large-scale AI clusters. The unit packs and extremely high-bandwidth SerDes, a PCIe 6.0 48-lane switch, and fully programmable RDMA capabilities to give GPUs a direct, low-latency path into Spectrum-X Ethernet and Quantum-X InfiniBand fabrics. The card is designed to enable zero-copy GPU-to-network data movement (through GPUDirect Async and NIXL) and significantly reduce CPU involvement in multi-node inference and training workloads.

Architecturally, ConnectX-9 is the endpoint that ties NVLink-connected racks, such as NVL144 or NVL576, to photonics-enabled scale-out networks, enabling Rubin- and Rubin Ultra–class systems to sustain the massive bandwidth required for multi-rack AI clusters. It also includes a secure on-board platform for isolation and telemetry, and provides the foundation for the high-density configurations of up to hundreds of NICs per cluster rack, which will likely become common with NVL144 and NVL576 systems.

We will update this page as we learn more about Nvidia’s upcoming Rubin architecture.

Google Preferred Source

Follow Tom’s Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.