NVIDIA vGPU support¶
This section describes how to configure the hypervisor in order to use NVIDIA vGPU features.
BIOS¶
You need to check that the following settings are enabled in your BIOS configuration:
Enable SR-IOV
Enable IOMMU
Note that the specific menu options where you need to activate these features depends on the motherboard manufacturer.
NVIDIA Drivers¶
The NVIDIA drivers are proprietary, so you will probably need to download them separately. Please check the documentation for your Linux distribution. Once you have installed and rebooted your server you should be able to access the GPU information as follows:
lsmod | grep vfio
nvidia_vgpu_vfio 57344 0
nvidia-smi
Wed Feb 9 12:36:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 On | 00000000:41:00.0 Off | 0 |
| 0% 52C P8 26W / 150W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Enable the NVIDIA vGPU¶
Warning
The following steps assume that your graphic card supports SR-IOV, if not, please refer to official NVIDIA documentation in order to activate vGPU.
Finding the PCI¶
lspci | grep NVIDIA
41:00.0 3D controller: NVIDIA Corporation Device 2236 (rev a1)
In this example the address is 41:00.0
. Then, we need to convert this to transformed-bdf format by replacing the colon and period with underscores, in our case: 41_00_0
. Now, we can obtain the PCI name, and the full information about the NVIDIA GPU (e.g. max number of virtual functions):
virsh nodedev-list --cap pci | grep 41_00_0
pci_0000_41_00_0
virsh nodedev-dumpxml pci_0000_41_00_0
<device>
<name>pci_0000_41_00_0</name>
<path>/sys/devices/pci0000:40/0000:40:03.1/0000:41:00.0</path>
<parent>pci_0000_40_03_1</parent>
<driver>
<name>nvidia</name>
</driver>
<capability type='pci'>
<class>0x030200</class>
<domain>0</domain>
<bus>65</bus>
<slot>0</slot>
<function>0</function>
<product id='0x2236'/>
<vendor id='0x10de'>NVIDIA Corporation</vendor>
<capability type='virt_functions' maxCount='32'/>
<iommuGroup number='44'>
<address domain='0x0000' bus='0x40' slot='0x03' function='0x1'/>
<address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
<address domain='0x0000' bus='0x40' slot='0x03' function='0x0'/>
</iommuGroup>
<pci-express>
<link validity='cap' port='0' speed='16' width='16'/>
<link validity='sta' speed='2.5' width='16'/>
</pci-express>
</capability>
</device>
Enabling Virtual Functions¶
Important
You may need to perform this operation every time you reboot your server.
# /usr/lib/nvidia/sriov-manage -e slot:bus:domain.function
/usr/lib/nvidia/sriov-manage -e 00:41:0000.0
Enabling VFs on 00:41:0000.0
If you get an error while doing this operation, please double check that all the BIOS steps have been correctly performed. If everything goes well, you should get something similar to this:
ls -l /sys/bus/pci/devices/0000:41:00.0/ | grep virtfn
lrwxrwxrwx 1 root root 0 Feb 9 10:37 virtfn0 -> ../0000:41:00.4
lrwxrwxrwx 1 root root 0 Feb 9 10:37 virtfn1 -> ../0000:41:00.5
lrwxrwxrwx 1 root root 0 Feb 9 10:37 virtfn10 -> ../0000:41:01.6
...
lrwxrwxrwx 1 root root 0 Feb 9 10:37 virtfn30 -> ../0000:41:04.2
lrwxrwxrwx 1 root root 0 Feb 9 10:37 virtfn31 -> ../0000:41:04.3
Configuring QEMU¶
Finally, add the following udev rule:
echo 'SUBSYSTEM=="vfio", GROUP="kvm", MODE="0666"' > /etc/udev/rules.d/opennebula-vfio.rules
# Reload udev rules:
udevadm control --reload-rules && udevadm trigger
Note
You can check full NVIDIA documentation here.
Using the vGPU¶
Once the setup is complete, you can follow the general steps for adding PCI devices to a VM. For NVIDIA GPUs, please consider the following:
OpenNebula supports both the legacy mediated device interface and the new vendor-specific interface introduced with Ubuntu 24.04. The vGPU device configuration is handled automatically by the virtualization and monitoring drivers. The monitoring process automatically sets the appropriate mode for each device using the
MDEV_MODE
attribute.NVIDIA vGPUs can be configured using different profiles, which define the vGPU’s characteristics and hardware capabilities. These profiles are retrieved from the drivers by the monitoring process, allowing you to easily select the one that best suits your application’s requirements.
The following example shows the monitoring information for a NVIDIA vGPU device:
onehost show -j 13
...
"PCI_DEVICES": {
"PCI": [
{
"ADDRESS": "0000:41:00:4",
"BUS": "41",
"CLASS": "0302",
"CLASS_NAME": "3D controller",
"DEVICE": "2236",
"DEVICE_NAME": "NVIDIA Corporation GA102GL [A10]",
"DOMAIN": "0000",
"FUNCTION": "4",
"MDEV_MODE": "nvidia",
"NUMA_NODE": "-",
"PROFILES": "588 (NVIDIA A10-1B),589 (NVIDIA A10-2B),590 (NVIDIA A10-1Q),591 (NVIDIA A10-2Q),592 (NVIDIA A10-3Q),593 (NVIDIA A10-4Q),594 (NVIDIA A10-6Q),595 (NVIDIA A10-8Q),596 (NVIDIA A10-12Q),597 (NVIDIA A10-24Q),598 (NVIDIA A10-1A),599 (NVIDIA A10-2A),600 (NVIDIA A10-3A),601 (NVIDIA A10-4A),602 (NVIDIA A10-6A),603 (NVIDIA A10-8A),604 (NVIDIA A10-12A),605 (NVIDIA A10-24A)",
"SHORT_ADDRESS": "41:00.4",
"SLOT": "00",
"TYPE": "10de:2236:0302",
"UUID": "e4042b96-e63d-56cf-bcc8-4e6eecccc12e",
"VENDOR": "10de",
"VENDOR_NAME": "NVIDIA Corporation",
"VMID": "-1"
}
Important
When using NVIDIA cards, ensure that only the GPU (for PCI passthrough) or vGPUs (for SR-IOV) are exposed through the PCI monitoring probe. Do not mix both types of devices in the same configuration.