Commit 83c81bae authored by Jim Harris's avatar Jim Harris
Browse files

doc/vhost: updated user guide



Change-Id: Ie2d9a949d44a2f50523736ade83dfd66935a6385
Signed-off-by: default avatarJim Harris <james.r.harris@intel.com>
Signed-off-by: default avatarDariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
Reviewed-on: https://review.gerrithub.io/391603


Tested-by: default avatarSPDK Automated Test System <sys_sgsw@intel.com>
Reviewed-by: default avatarBen Walker <benjamin.walker@intel.com>
parent 4d48d87a
Loading
Loading
Loading
Loading
+282 −135
Original line number Diff line number Diff line
# vhost {#vhost}

# vhost Users Guide {#vhost_users_guide}

The Storage Performance Development Kit vhost application is named `vhost`.
This application extends SPDK to present virtio storage controllers to QEMU-based
VMs and process I/O submitted to devices attached to those controllers.
# Table of Contents {#vhost_toc}

- @ref vhost_intro
- @ref vhost_prereqs
- @ref vhost_start
- @ref vhost_config
- @ref vhost_qemu_config
- @ref vhost_example
- @ref vhost_advanced_topics
- @ref vhost_bugs

# Introduction {#vhost_intro}

A vhost target provides a local storage service as a process running on a local machine.
It is capable of exposing virtualized block devices to QEMU instances or other arbitrary
processes. These processes communicate with the vhost target using the
[virtio protocol](https://wiki.libvirt.org/page/Virtio), a standardized protocol for
paravirtualized devices.

SPDK provides an accelerated vhost target by applying the same user space and polling
techniques as other components in SPDK.  Since SPDK is polling for virtio submissions,
it can signal the virtio driver to skip notifications on submission.  This avoids VMEXITs on I/O
submission and can significantly reduce CPU usage in the guest VM on heavy I/O workloads.

The following diagram presents how QEMU-based VM communicates with an SPDK vhost device.

    +-------QEMU-VM--------+             +---------------SPDK-vhost-------------+
    |                      |             |                                      |
    |  +----------------+  |             |  +--------------------------------+  |
    |  |                |  |             |  |                                |  |
    |  |  Virtio-SCSI   |  |  eventfd    |  |               +-------------+  |  |
    |  |  Linux driver  |  |  interrupt  |  |  Virtio-SCSI  |             |  |  |
    |  |                |  <----------------+  device       |  NVMe disk  |  |  |
    |  +--------^-------+  |             |  |               |             |  |  |
    |           |          |             |  |               +-------^-----+  |  |
    +----------------------+             |  |                       |        |  |
                |                        |  +----------^---------------------+  |
                |                        |             |            |           |
                |                        +--------------------------------------+
                |                                      |            |
                |                              polling |            | DMA
                |                                      |            |
    +-----------v----Shared hugepage memory------------+------------------------+
    |                                                               |           |
    |  +----------------------------------+-------------------------v--------+  |
    +  |            Virtqueues            |              Buffers             |  |
    |  +----------------------------------+----------------------------------+  |
    |                                                                           |
    +---------------------------------------------------------------------------+

# Prerequisites {#vhost_prereqs}

@@ -16,12 +60,18 @@ getting_started. The SPDK vhost target is built with the default configure opti
The guest OS must contain virtio-scsi or virtio-blk drivers.  Most Linux and FreeBSD
distributions include virtio drivers.
[Windows virtio drivers](https://fedoraproject.org/wiki/Windows_Virtio_Drivers) must be
installed separately.  The SPDK vhost target has been tested with Ubuntu 16.04, Fedora
25, and Windows 2012 R2.
installed separately.  The SPDK vhost target has been tested with recent versions of Ubuntu,
Fedora, and Windows

## QEMU

Userspace vhost-scsi target support was added to upstream QEMU in v2.10.0.
Userspace vhost-scsi target support was added to upstream QEMU in v2.10.0.  Run
the following command to confirm your QEMU supports userspace vhost-scsi.

~~~{.sh}
qemu-system-x86_64 -device vhost-user-scsi-pci,help
~~~

Userspace vhost-blk target support is not yet upstream in QEMU, but patches
are available in SPDK's QEMU repository:

@@ -34,117 +84,251 @@ cd build
make
~~~

# Configuration {#vhost_config}
Run the following command to confirm your QEMU supports userspace vhost-blk.

~~~{.sh}
qemu-system-x86_64 -device vhost-user-blk-pci,help
~~~

# Starting SPDK vhost target {#vhost_start}

## SPDK
First, run the SPDK setup.sh script to setup some hugepages for the SPDK vhost target
application.  This will allocate 4096MiB (4GiB) of hugepages, enough for the SPDK
vhost target and the virtual machine.

A vhost-specific configuration file is used to configure the SPDK vhost
target.  A fully documented example configuration file is located at
`etc/spdk/vhost.conf.in`.  This file defines the following:
~~~{.sh}
HUGEMEM=4096 scripts/setup.sh
~~~

### Storage Backends
Next, start the SPDK vhost target application.  The following command will start vhost
on CPU cores 0 and 1 (cpumask 0x3) with all future socket files placed in /var/tmp.
Vhost will fully occupy given CPU cores for I/O polling. Particular vhost devices can
be restricted to run on a subset of these CPU cores. See @ref vhost_vdev_create for
details.

~~~{.sh}
app/vhost/vhost -S /var/tmp -m 0x3
~~~

To list all available vhost options use the following command.

~~~{.sh}
app/vhost/vhost -h
~~~

# SPDK Configuration {#vhost_config}

## Create bdev (block device) {#vhost_bdev_create}

SPDK bdevs are block devices which will be exposed to the guest OS.
For vhost-scsi, bdevs are exposed as as SCSI LUNs on SCSI devices attached to the
vhost-scsi controller in the guest OS.
For vhost-blk, bdevs are exposed directly as block devices in the guest OS and are
not associated at all with SCSI.

Storage backends are devices which will be exposed to the guest OS.
Vhost-blk backends are exposed as block devices in the guest OS, and vhost-scsi backends are
exposed as SCSI LUNs on devices attached to the vhost-scsi controller in the guest OS.
SPDK supports several different types of storage backends, including NVMe,
Linux AIO, malloc ramdisk and Ceph RBD.  Refer to @ref bdev_getting_started for
additional information on specifying storage backends in the configuration file.
additional information on configuring SPDK storage backends.

This guide will use a malloc bdev (ramdisk) named Malloc0. The following RPC
will create a 64MB malloc bdev with 512-byte block size.

~~~{.sh}
scripts/rpc.py construct_malloc_bdev 64 512 -b Malloc0
~~~

## Create a virtio device {#vhost_vdev_create}

### Mappings Between Block Controllers and Storage Backends
### Vhost-SCSI

The vhost target exposes block devices to the virtual machines.
The device in the vhost controller is associated with an SPDK block device, and the
configuration file defines those associations.  The block device to Dev mapping
is specified in the configuration file as:
The following RPC will create a vhost-scsi controller which can be accessed
by QEMU via /var/tmp/vhost.0. At the time of creation the controller will be
bound to a single CPU core with the smallest number of vhost controllers.
The optional `--cpumask` parameter can directly specify which cores should be
taken into account - in this case always CPU 0. To achieve optimal performance
on NUMA systems, the cpumask should specify cores on the same CPU socket as its
associated VM.

~~~{.sh}
scripts/rpc.py construct_vhost_scsi_controller --cpumask 0x1 vhost.0
~~~
[VhostBlkX]
  Name vhost.X       	   	# Name of vhost socket
  Dev BackendX			# "BackendX" is block device name from previous
                     		# sections in config file

  #Cpumask 0x1          	# Optional parameter defining which core controller uses
The following RPC will attach the Malloc0 bdev to the vhost.0 vhost-scsi
controller.  Malloc0 will appear as a single LUN on a SCSI device with
target ID 0. SPDK Vhost-SCSI device currently supports only one LUN per SCSI target.
Additional LUNs can be added by specifying a different target ID.

~~~{.sh}
scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Malloc0
~~~

To remove a bdev from a vhost-scsi controller use the following RPC:

~~~{.sh}
scripts/rpc.py remove_vhost_scsi_dev vhost.0 0
~~~

### Mappings Between SCSI Controllers and Storage Backends
### Vhost-BLK

The vhost target exposes SCSI controllers to the virtual machine application(s).
Each SPDK vhost SCSI controller can expose up to eight targets (0..7). LUN 0 of each target
is associated with an SPDK block device and configuration file defines those associations.
The block device to Target mappings are specified in the configuration file as:
The following RPC will create a vhost-blk device exposing Malloc0 bdev.
The device will be accessible to QEMU via /var/tmp/vhost.1. All the I/O polling
will be pinned to the least occupied CPU core within given cpumask - in this case
always CPU 0. For NUMA systems, the cpumask should specify cores on the same CPU
socket as its associated VM.

~~~{.sh}
scripts/rpc.py construct_vhost_blk_controller --cpumask 0x1 vhost.1 Malloc0
~~~
[VhostScsiX]
  Name vhost.X			# Name of vhost socket
  Target 0 BackendX		# "BackendX" is block device name from previous
                        	# sections in config file
  Target 1 BackendY
  ...
  Target n BackendN

  #Cpumask 0x1          	# Optional parameter defining which core controller uses
It is also possible to construct a read-only vhost-blk device by specifying an
extra `-r` or `--readonly` parameter.

~~~{.sh}
scripts/rpc.py construct_vhost_blk_controller --cpumask 0x1 -r vhost.1 Malloc0
~~~

Users should update configuration with 'Target' keyword.
## QEMU {#vhost_qemu_config}

### Vhost Sockets
Now the virtual machine can be started with QEMU.  The following command-line
parameters must be added to connect the virtual machine to its vhost controller.

Userspace vhost uses UNIX domain sockets for communication between QEMU
and the vhost target.  Each vhost controller is associated with a UNIX domain
socket file with filename equal to the Name argument in configuration file.
Sockets are created at current working directory when starting the SPDK vhost
target.
First, specify the memory backend for the virtual machine.  Since QEMU must
share the virtual machine's memory with the SPDK vhost target, the memory
must be specified in this format with share=on.

### Core Affinity Configuration
~~~{.sh}
-object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on
~~~

Vhost target can be restricted to run on certain cores by specifying a `ReactorMask`.
Default is to allow vhost target work on core 0. For NUMA systems, it is essential
to run vhost with cores on each socket to achieve optimal performance.
Second, ensure QEMU boots from the virtual machine image and not the
SPDK malloc block device by specifying bootindex=0 for the boot image.

Each controller may be assigned a set of cores using the optional
`Cpumask` parameter in configuration file.  For NUMA systems, the Cpumask should
specify cores on the same CPU socket as its associated VM. The `vhost` application will
pick one core from `ReactorMask` masked by `Cpumask`. `Cpumask` must be a subset of
`ReactorMask`.
~~~{.sh}
-drive file=guest_os_image.qcow2,if=none,id=disk
-device ide-hd,drive=disk,bootindex=0
~~~

## QEMU
Finally, specify the SPDK vhost devices:

Userspace vhost-scsi adds the following command line option for QEMU:
### Vhost-SCSI

~~~{.sh}
-chardev socket,id=char0,path=/var/tmp/vhost.0
-device vhost-user-scsi-pci,id=scsi0,chardev=char0
~~~
-device vhost-user-scsi-pci,id=scsi0,chardev=char0[,num_queues=N]

### Vhost-BLK

~~~{.sh}
-chardev socket,id=char1,path=/var/tmp/vhost.1
-device vhost-user-blk-pci,id=blk0,chardev=char1,logical_block_size=512,size=64M
~~~

Userspace vhost-blk adds the following command line option for QEMU:
## Example output {#vhost_example}

This example uses an NVMe bdev alongside Mallocs. SPDK vhost application is started
on CPU cores 0 and 1, QEMU on cores 2 and 3.

~~~{.sh}
host:~# HUGEMEM=2048 ./scripts/setup.sh
0000:01:00.0 (8086 0953): nvme -> vfio-pci
~~~

~~~{.sh}
host:~# ./app/vhost/vhost -S /var/tmp -s 1024 -m 0x3 &
Starting DPDK 17.11.0 initialization...
[ DPDK EAL parameters: vhost -c 3 -m 1024 --master-lcore=1 --file-prefix=spdk_pid156014 ]
EAL: Detected 48 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
app.c: 369:spdk_app_start: *NOTICE*: Total cores available: 2
reactor.c: 668:spdk_reactors_init: *NOTICE*: Occupied cpu socket mask is 0x1
reactor.c: 424:_spdk_reactor_run: *NOTICE*: Reactor started on core 1 on socket 0
reactor.c: 424:_spdk_reactor_run: *NOTICE*: Reactor started on core 0 on socket 0
~~~
-device vhost-user-blk-pci,logical_block_size=4096,size=512M,chardev=char0[,num_queues=N]

~~~{.sh}
host:~# ./scripts/rpc.py construct_nvme_bdev -b Nvme0 -t pcie -a 0000:01:00.0
EAL: PCI device 0000:01:00.0 on NUMA socket 0
EAL:   probe driver: 8086:953 spdk_nvme
EAL:   using IOMMU type 1 (Type 1)
~~~

In order to start qemu with vhost you need to specify following options:
~~~{.sh}
host:~# ./scripts/rpc.py construct_malloc_bdev 128 4096 Malloc0
Malloc0
~~~

 - Socket, which QEMU will use for vhost communication with SPDK:
~~~{.sh}
host:~# ./scripts/rpc.py construct_vhost_scsi_controller --cpumask 0x1 vhost.0
VHOST_CONFIG: vhost-user server: socket created, fd: 21
VHOST_CONFIG: bind to /var/tmp/vhost.0
vhost.c: 596:spdk_vhost_dev_construct: *NOTICE*: Controller vhost.0: new controller added
~~~
-chardev socket,id=char0,path=/path/to/vhost/socket

~~~{.sh}
host:~# ./scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Nvme0n1
vhost_scsi.c: 840:spdk_vhost_scsi_dev_add_tgt: *NOTICE*: Controller vhost.0: defined target 'Target 0' using lun 'Nvme0'

~~~

 - Hugepages to share memory between vm and vhost target
~~~{.sh}
host:~# ./scripts/rpc.py add_vhost_scsi_lun vhost.0 1 Malloc0
vhost_scsi.c: 840:spdk_vhost_scsi_dev_add_tgt: *NOTICE*: Controller vhost.0: defined target 'Target 1' using lun 'Malloc0'
~~~
-object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on

~~~{.sh}
host:~# ./scripts/rpc.py construct_malloc_bdev 64 512 -b Malloc1
Malloc1
~~~

# Running Vhost Target
~~~{.sh}
host:~# ./scripts/rpc.py construct_vhost_blk_controller --cpumask 0x2 vhost.1 Malloc1
vhost_blk.c: 719:spdk_vhost_blk_construct: *NOTICE*: Controller vhost.1: using bdev 'Malloc1'
~~~

To get started, the following example is usually sufficient:
~~~{.sh}
host:~# taskset -c 2,3 qemu-system-x86_64 \
  --enable-kvm \
  -cpu host -smp 2 \
  -m 1G -object memory-backend-file,id=mem0,size=1G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem0 \
  -drive file=guest_os_image.qcow2,if=none,id=disk \
  -device ide-hd,drive=disk,bootindex=0 \
  -chardev socket,id=spdk_vhost_scsi0,path=/var/tmp/vhost.0 \
  -device vhost-user-scsi-pci,id=scsi0,chardev=spdk_vhost_scsi0,num_queues=4 \
  -chardev socket,id=spdk_vhost_blk0,path=/var/tmp/vhost.1 \
  -device vhost-user-blk-pci,logical_block_size=512,size=64M,chardev=spdk_vhost_blk0,num_queues=4
~~~
app/vhost/vhost -c /path/to/vhost.conf

Please note the following two commands are run on the guest VM.

~~~{.sh}
guest:~# lsblk --output "NAME,KNAME,MODEL,HCTL,SIZE,VENDOR,SUBSYSTEMS"
NAME   KNAME MODEL            HCTL         SIZE VENDOR   SUBSYSTEMS
sda    sda   QEMU HARDDISK    1:0:0:0       80G ATA      block:scsi:pci
  sda1 sda1                                 80G          block:scsi:pci
sdb    sdb   NVMe disk        2:0:0:0    372,6G INTEL    block:scsi:virtio:pci
sdc    sdc   Malloc disk      2:0:1:0      128M INTEL    block:scsi:virtio:pci
vda    vda                                 128M 0x1af4   block:virtio:pci
~~~

A full list of command line arguments to vhost can be obtained by:
~~~{.sh}
guest:~# poweroff
~~~
app/vhost/vhost -h

~~~{.sh}
host:~# fg
<< CTRL + C >>
vhost.c:1006:session_shutdown: *NOTICE*: Exiting
~~~

## Multi-Queue Block Layer (blk-mq)
We can see that `sdb` and `sdc` are SPDK vhost-scsi LUNs, and `vda` is SPDK a
vhost-blk disk.


# Advanced Topics {#vhost_advanced_topics}

## Multi-Queue Block Layer (blk-mq) {#vhost_multiqueue}

For best performance use the Linux kernel block multi-queue feature with vhost.
To enable it on Linux, it is required to modify kernel options inside the
@@ -163,80 +347,43 @@ to set `num_queues=4` to saturate physical device. Adding too many queues might
vhost performance degradation if many vhost devices are used because each device will require
additional `num_queues` to be polled.

# Example {#vhost_example}
## Hot-attach/hot-detach {#vhost_hotattach}

Run SPDK vhost with two controllers: Virtio SCSI and Virtio block.
Hotplug/hotremove within a vhost controller is called hot-attach/detach. This is to
distinguish it from SPDK bdev hotplug/hotremove. E.g. if an NVMe bdev is attached
to a vhost-scsi controller, physically hotremoving the NVMe will trigger vhost-scsi
hot-detach. It is also possible to hot-detach a bdev manually via RPC - for example
when the bdev is about to be attached to another controller. See the details below.

Virtio SCSI controller with two LUNs:
Please also note that hot-attach/detach is Vhost-SCSI-specific. There are no RPCs
to hot-attach/detach the bdev from a Vhost-BLK device. If Vhost-BLK device exposes
an NVMe bdev that is hotremoved, all the I/O traffic on that Vhost-BLK device will
be aborted - possibly flooding a VM with syslog warnings and errors.

- SCSI target 1, LUN 0 backed by Malloc0 bdev
- SCSI target 5, LUN 0 backed by Malloc1 bdev
### Hot-attach

Virtio block device backed by Malloc2 bdev
Hot-attach is is done by simply attaching a bdev to a vhost controller with a QEMU VM
already started. No other extra action is necessary.

For better performance use 4 VCPU (`-smp 4`) and 4 queues for each controller
(`num_queues=4`). Assume that QEMU and SPDK are in respectively `qemu` and `spdk` directories.
~~~{.sh}
scripts/rpc.py add_vhost_scsi_lun vhost.0 0 Malloc0
~~~
host: $ cd spdk
host: $ cat vhost.conf
[Malloc]
  NumberOfLuns 3
  LunSizeInMb 128
  BlockSize 512

[VhostScsi0]
  Name vhost_scsi0_socket
  Dev 1 Malloc0
  Dev 5 Malloc1
### Hot-detach

[VhostBlk0]
  Name vhost_blk0_socket
  Dev Malloc2
Just like hot-attach, the hot-detach is done by simply removing bdev from a controller
when QEMU VM is already started.

host: $ sudo ./app/vhost/vhost -c vhost.conf -s 1024 -m 1 &
[ DPDK EAL parameters: vhost -c 1 -m 1024 --file-prefix=spdk_pid191213 ]
EAL: Detected 48 lcore(s)
EAL: Probing VFIO support...
EAL: VFIO support initialized
<< REMOVED CONSOLE LOG >>
VHOST_CONFIG: bind to vhost_scsi0_socket
vhost.c: 592:spdk_vhost_dev_construct: *NOTICE*: Controller vhost_scsi0_socket: new controller added
vhost_scsi.c: 840:spdk_vhost_scsi_dev_add_tgt: *NOTICE*: Controller vhost_scsi0_socket: defined target 'Target 1' using lun 'Malloc0'
vhost_scsi.c: 840:spdk_vhost_scsi_dev_add_tgt: *NOTICE*: Controller vhost_scsi0_socket: defined target 'Target 5' using lun 'Malloc1'
VHOST_CONFIG: vhost-user server: socket created, fd: 65
VHOST_CONFIG: bind to vhost_blk0_socket
vhost.c: 592:spdk_vhost_dev_construct: *NOTICE*: Controller vhost_blk0_socket: new controller added
vhost_blk.c: 720:spdk_vhost_blk_construct: *NOTICE*: Controller vhost_blk0_socket: using bdev 'Malloc2'

host: $ cd ..
host: $ sudo ./qemu/build/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -m 1024 \
  -cpu host -smp 4 -nographic \
  -object memory-backend-file,id=mem,size=1G,mem-path=/dev/hugepages,share=on -numa node,memdev=mem \
  -drive file=guest_os_image.qcow2,if=none,id=disk \
  -device ide-hd,drive=disk,bootindex=0 \
  -chardev socket,id=spdk_vhost_scsi0,path=./spdk/vhost_scsi0_socket \
  -device vhost-user-scsi-pci,id=scsi0,chardev=spdk_vhost_scsi0,num_queues=4 \
  -chardev socket,id=spdk_vhost_blk0,path=./spdk/vhost_blk0_socket \
  -device vhost-user-blk-pci,logical_block_size=512,size=128M,chardev=spdk_vhost_blk0,num_queues=4
~~~{.sh}
scripts/rpc.py remove_vhost_scsi_dev vhost.0 0
~~~

<< LOGIN TO GUEST OS >>
guest: ~$ lsblk --output "NAME,KNAME,MODEL,HCTL,SIZE,VENDOR,SUBSYSTEMS"
NAME   KNAME MODEL            HCTL        SIZE VENDOR   SUBSYSTEMS
fd0    fd0                                  4K          block:platform
sda    sda   QEMU HARDDISK    1:0:0:0      80G ATA      block:scsi:pci
  sda1 sda1                                80G          block:scsi:pci
sdb    sdb   Malloc disk      2:0:1:0     128M INTEL    block:scsi:virtio:pci
sdc    sdc   Malloc disk      2:0:5:0     128M INTEL    block:scsi:virtio:pci
vda    vda                                128M 0x1af4   block:virtio:pci
Removing an entire bdev will hot-detach it from a controller as well.

guest: $ sudo poweroff
host: $ fg
<< CTRL + C >>
vhost.c:1006:session_shutdown: *NOTICE*: Exiting
~~~{.sh}
scripts/rpc.py delete_bdev Malloc0
~~~

We can see that `sdb` and `sdc` are SPDK vhost-scsi LUNs, and `vda` is SPDK vhost-blk disk.

# Known bugs and limitations {#vhost_bugs}

## Windows virtio-blk driver before version 0.1.130-1 only works with 512-byte sectors