PCI Passthrough¶
It is possible to discover PCI devices in the Hosts and assign them to Virtual Machines for the KVM hypervisor.
The setup and environment information is taken from here. You can safely ignore all the VGA related sections, for PCI devices that are not graphic cards, or if you don’t want to output video signal from them.
Warning
The overall setup state was extracted from a preconfigured Fedora 22 machine. Configuration for your distro may be different.
Requirements¶
- The host that is going to be used for virtualization needs to support I/O MMU. For Intel processors this is called VT-d and for AMD processors is called AMD-Vi. The instructions are made for Intel branded processors but the process should be very similar for AMD.
- kernel >= 3.12
Machine Configuration (Hypervisor)¶
Kernel Configuration¶
The kernel must be configured to support I/O MMU and to blacklist any driver that could be accessing the PCI’s that we want to use in our VMs. The parameter to enable I/O MMU is:
intel_iommu=on
We also need to tell the kernel to load the vfio-pci
driver and blacklist the drivers for the selected cards. For example, for nvidia GPUs we can use these parameters:
rd.driver.pre=vfio-pci rd.driver.blacklist=nouveau
Loading vfio Driver in initrd¶
The modules for vfio must be added to initrd. The list of modules are vfio vfio_iommu_type1 vfio_pci vfio_virqfd
. For example, if your system uses dracut
add the file /etc/dracut.conf.d/local.conf
with this line:
add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"
and regenerate initrd
:
# dracut --force
Driver Blacklisting¶
The same blacklisting done in the kernel parameters must be done in the system configuration. /etc/modprobe.d/blacklist.conf
for nvidia GPUs:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
Alongside this configuration vfio driver should be loaded passing the id of the PCI cards we want to attach to VMs. For example, for nvidia Grid K2 GPU we pass the id 10de:11bf
. File /etc/modprobe.d/local.conf
:
options vfio-pci ids=10de:11bf
vfio Device Binding¶
I/O MMU separates PCI cards into groups to isolate memory operation between devices and VMs. To add the cards to vfio and assign a group to them we can use the scripts shared in the aforementioned web page.
This script binds a card to vfio. It goes into /usr/local/bin/vfio-bind
:
#!/bin/sh
modprobe vfio-pci
for dev in "$@"; do
vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
device=$(cat /sys/bus/pci/devices/$dev/device)
if [ -e /sys/bus/pci/devices/\$dev/driver ]; then
echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
fi
echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
done
The configuration goes into /etc/sysconfig/vfio-bind
. The cards are specified with PCI addresses. Addresses can be retrieved with lspci
command. Make sure to prepend the domain that is usually 0000
. For example:
DEVICES="0000:04:00.0 0000:05:00.0 0000:84:00.0 0000:85:00.0"
Here is a systemd script that executes the script. It can be written to /etc/systemd/system/vfio-bind.service
and enabled:
[Unit]
Description=Binds devices to vfio-pci
After=syslog.target
[Service]
EnvironmentFile=-/etc/sysconfig/vfio-bind
Type=oneshot
RemainAfterExit=yes
ExecStart=-/usr/local/bin/vfio-bind $DEVICES
[Install]
WantedBy=multi-user.target
qemu Configuration¶
Now we need to give qemu access to the vfio devices for the groups assigned to the PCI cards. We can get a list of PCI cards and its I/O MMU group using this command:
# find /sys/kernel/iommu_groups/ -type l
In our example our cards have the groups 45, 46, 58 and 59 so we add this configuration to /etc/libvirt/qemu.conf
:
cgroup_device_acl = [
"/dev/null", "/dev/full", "/dev/zero",
"/dev/random", "/dev/urandom",
"/dev/ptmx", "/dev/kvm", "/dev/kqemu",
"/dev/rtc","/dev/hpet", "/dev/vfio/vfio",
"/dev/vfio/45", "/dev/vfio/46", "/dev/vfio/58",
"/dev/vfio/59"
]
Driver Configuration¶
The only configuration needed is the filter for the monitoring probe that gets the list of PCI cards. By default, the probe doesn’t list any cards from a host. To narrow the list, configuration can be changed in /var/lib/one/remotes/etc/im/kvm-probes.d/pci.conf
. Following configuration attributes are available:
Parameter | Description |
---|---|
filter | List Filters by PCI vendor:device:class patterns (same as as for lspci ) |
short_address | List Filters by short PCI address bus:device.function |
device_name | List Filters by device names with case-insensitive regular expression patterns |
All filters are applied on the final PCI cards list.
Example:
# This option specifies the main filters for PCI card monitoring. The format
# is the same as used by lspci to filter on PCI card by vendor:device(:class)
# identification. Several filters can be added as a list, or separated
# by commas. The NULL filter will retrieve all PCI cards.
#
# From lspci help:
# -d [<vendor>]:[<device>][:<class>]
# Show only devices with specified vendor, device and class ID.
# The ID's are given in hexadecimal and may be omitted or given
# as "*", both meaning "any value"#
#
# For example:
# :filter:
# - '10de:*' # all NVIDIA VGA cards
# - '10de:11bf' # only GK104GL [GRID K2]
# - '*:10d3' # only 82574L Gigabit Network cards
# - '8086::0c03' # only Intel USB controllers
#
# or
#
# :filter: '*:*' # all devices
#
# or
#
# :filter: '0:0' # no devices
#
:filter: '*:*'
# The PCI cards list restricted by the :filter option above can be even more
# filtered by the list of exact PCI addresses (bus:device.func).
#
# For example:
# :short_address:
# - '07:00.0'
# - '06:00.0'
#
:short_address:
- '00:1f.3'
# The PCI cards list restricted by the :filter option above can be even more
# filtered by matching the device name against the list of regular expression
# case-insensitive patterns.
#
# For example:
# :device_name:
# - 'Virtual Function'
# - 'Gigabit Network'
# - 'USB.*Host Controller'
# - '^MegaRAID'
#
:device_name:
- 'Ethernet'
- 'Audio Controller'
Usage¶
The basic workflow is to inspect the host information, either in the CLI or in Sunstone, to find out the available PCI devices, and to add the desired device to the template. PCI devices can be added by specifying VENDOR, DEVICE and CLASS, or simply CLASS. Note that OpenNebula will only deploy the VM in a host with the available PCI device. If no hosts match, an error message will appear in the Scheduler log.
CLI¶
A new table in onehost show
command gives us the list of PCI devices per host. For example:
PCI DEVICES
VM ADDR TYPE NAME
00:00.0 8086:0a04:0600 Haswell-ULT DRAM Controller
00:02.0 8086:0a16:0300 Haswell-ULT Integrated Graphics Controller
123 00:03.0 8086:0a0c:0403 Haswell-ULT HD Audio Controller
00:14.0 8086:9c31:0c03 8 Series USB xHCI HC
00:16.0 8086:9c3a:0780 8 Series HECI #0
00:1b.0 8086:9c20:0403 8 Series HD Audio Controller
00:1c.0 8086:9c10:0604 8 Series PCI Express Root Port 1
00:1c.2 8086:9c14:0604 8 Series PCI Express Root Port 3
00:1d.0 8086:9c26:0c03 8 Series USB EHCI #1
00:1f.0 8086:9c43:0601 8 Series LPC Controller
00:1f.2 8086:9c03:0106 8 Series SATA Controller 1 [AHCI mode]
00:1f.3 8086:9c22:0c05 8 Series SMBus Controller
02:00.0 8086:08b1:0280 Wireless 7260
- VM: The VM ID using that specific device. Empty if no VMs are using that device.
- ADDR: PCI Address.
- TYPE: Values describing the device. These are VENDOR:DEVICE:CLASS. These values are used when selecting a PCI device do to passthrough.
- NAME: Name of the PCI device.
To make use of one of the PCI devices in a VM a new option can be added selecting which device to use. For example this will ask for a Haswell-ULT HD Audio Controller
:
PCI = [
VENDOR = "8086",
DEVICE = "0a0c",
CLASS = "0403"
]
The device can be also specified without all the type values. For example, to get any PCI Express Root Ports this can be added to a VM tmplate:
PCI = [
CLASS = "0604"
]
More than one PCI
options can be added to attach more than one PCI device to the VM.
Sunstone¶
In Sunstone the information is displayed in the PCI tab:
To add a PCI device to a template, select the Other tab:
Usage as Network Interfaces¶
It is possible use a PCI device as a NIC interface directly in OpenNebula. In order to do so you will need to follow the configuration steps mentioned in this guide, namely changing the device driver.
When defining a Network that will be used for PCI passthrough nics, please use either the dummy
network driver or the 802.1Q
if you are using VLAN. In any case, type any random value into the BRIDGE
field, and it will be ignored. For 802.1Q
you can also leave PHYDEV
blank.
The context packages support the configuration of the following attributes:
MAC
: It will change the mac address of the corresponding network interface to the MAC assigned by OpenNebula.IP
: It will assign an IPv4 address to the interface, assuming a/24
netmask.IPV6
: It will assign an IPv6 address to the interface, assuming a/128
netmask.VLAN_ID
: If present, it will create a tagged interface and assign the IPs to the tagged interface.
CLI¶
When a PCI
in a template contains the attribute TYPE="NIC"
, it will be treated as a NIC
and OpenNebula will assign a MAC address, a VLAN_ID, an IP, etc, to the PCI device.
This is an example of the PCI section of an interface that will be treated as a NIC:
PCI=[
NETWORK="passthrough",
NETWORK_UNAME="oneadmin",
TYPE="NIC",
CLASS="0200",
DEVICE="10d3",
VENDOR="8086" ]
Note that the order of appearence of the PCI
elements and NIC
elements in the template is relevant. The will be mapped to nics in the order they appear, regardless if they’re NICs of PCIs.
Sunstone¶
In the Network tab, under advanced options check the PCI Passthrough option and fill in the PCI address. Use the rest of the dialog as usual by selecting a network from the table.