NVIDIA Spectrum-X Integration (EE)

This guide provides a high-level overview of the OpenNebula integration with the NVIDIA Spectrum-X™ Ethernet networking platform. This integration allows OpenNebula to act as a single pane of glass for managing an entire AI factory, from compute and storage to the high-performance network fabric.

The integration works by mapping OpenNebula’s logical resource constructs (like Users and Virtual Networks) directly to the Spectrum-X fabric’s tenant segments, which are based on a routed L3EVPN architecture to deliver isolated, high-bandwidth East-West (E/W) traffic for demanding AI and HPC workloads.

The Spectrum-X Platform

NVIDIA Spectrum-X is the first Ethernet fabric built from the ground up to accelerate AI workloads. It delivers advanced performance, scalability, and network intelligence, ensuring consistent, predictable results in a multi-tenant AI cloud.

The platform is built on two key components:

NVIDIA Spectrum-4 Switches: High-bandwidth, low-latency switches that provide RoCE-optimized routing and advanced congestion control.
NVIDIA BlueField-3 SuperNICs: A new class of network adapter that accelerates and secures the network, moving networking and security tasks from the CPU to the DPU.

The fabric uses a routed L3EVPN architecture to create isolated tenant environments. Each tenant is assigned a separate Virtual Routing and Forwarding (VRF) instance on the leaf switches, ensuring traffic from one tenant is logically separated from another.

OpenNebula Integration Concepts

The integration between OpenNebula and Spectrum-X is achieved by creating a clear mapping between OpenNebula’s resource management constructs and the physical network’s tenant architecture.

Resource Mapping

Tenant Mapping: An AI Factory tenant is directly mapped to a User in OpenNebula. This user is then granted access to a specific set of isolated resources (N/S vNet, BlueField-3 PCI Device and GPU PCI Device).
Network Mapping: The integration distinguishes between two traffic patterns:
- North-South (N/S) Network: This is the standard management and external access network for a VM. It is implemented in OpenNebula as a regular Virtual Network (vNet).
- East-West (E/W) Network: This is the high-performance Spectrum-X fabric used for GPU-to-GPU communication.The link between these two networks is established by storing the tenant E/W VXLAN Network Identifier (VNI) as a custom attribute, SPX_VNI, within the N/S Virtual Network template in OpenNebula. A tenant can attach a VM to its own E/W segment by attaching a specific BlueField-3 PCI device to the VM.
Hardware Access:
- NVIDIA GPUs and BlueField-3 SuperNICs are represented in OpenNebula as PCI Devices.
- Access is granted to tenants by assigning ownership or group access to these PCI devices.
- To enable dynamic E/W fabric configuration, the PCI device template for each SuperNIC must store critical networking information as custom attributes:
  - SPX_NIC_IP: The static IP address of the SuperNIC’s interface. This IP address must remain static due to the routed L3EVPN nature of the E/W fabric.
  - SPX_LEAF_IP: The IP address of the leaf switch the SuperNIC is connected to.
  - SPX_LEAF_PORT: The physical port name on the leaf switch where the SuperNIC is connected.

Dynamic Fabric Configuration

OpenNebula orchestrates the Spectrum-X fabric dynamically using network hooks. When a user deploys a VM, these hooks execute scripts on the hypervisor that configure the Spectrum-X leaf switches.

The high-level workflow is as follows:

A tenant instantiates a VM Template containing both a standard N/S network interface and one or more E/W PCI passthrough devices (the BlueField-3 SuperNICs).
The VM’s context contains all the necessary attributes: SPX_VNI (from the N/S vNet) and the SPX_* attributes (from the PCI devices).
Upon deployment, an OpenNebula network hook runs on the target hypervisor. This hook establishes an SSH connection to the corresponding leaf switches.
The hook uses NVUE commands on the switch to build the tenant E/W datapath, allowing fully tenant-isolated GPU-to-GPU connectivity.

Current Status and Considerations

Important

This is a high-level overview of the integration. Customers interested in a detailed technical discussion and production deployment should contact OpenNebula Systems.

Availability: This integration is part of the OpenNebula Enterprise Edition and is available as a reference implementation.
Validation Environment: The integration has been fully developed and validated in the NVIDIA Air cloud simulation platform, which provides a faithful, large-scale simulation of a Spectrum-X hardware environment.

GIVE FEEDBACK

Was this resource helpful?

Glad to hear it

Sorry to hear that