Commit ed0b611f authored Oct 04, 2018 by Evgeniy Kochetov Committed by Jim Harris Mar 15, 2019

nvmf/rdma: Add shared receive queue support



This is a new feature for NVMEoF RDMA target, that is intended to save
resource allocation (by sharing them) and utilize the
locality (completions and memory) to get the best performance with
Shared Receive Queues (SRQs). We'll create a SRQ per core (poll
group), per device and associate each created QP/CQ with an
appropriate SRQ.

Our testing environment has 2 hosts.
Host 1:
  CPU: Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz dual socket (8 cores total)
  Network: ConnectX-5, ConnectX-5 VPI , 100GbE, single-port QSFP28, PCIe3.0 x16
  Disk: Intel Optane SSD 900P Series
  OS: Fedora 27 x86_64
Host 2:
  CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz dual-socket (24 cores total)
  Network: ConnectX-4 VPI , 100GbE, dual-port QSFP28
  Disk: Intel Optane SSD 900P Series
  OS : CentOS 7.5.1804 x86_64
Hosts are connected via Spectrum switch.
Host 1 is running SPDK NVMeoF target.
Host 2 is used as initiator running fio with SPDK plugin.

Configuration:
- SPDK NVMeoF target: cpu mask 0x0F (4 cores), max queue depth 128,
  max SRQ depth 1024, max QPs per controller 1024
- Single NVMf subsystem with single namespace backed by physical SSD disk
- fio with SPDK plugin: randread pattern, 1-256 jobs, block size 4k,
  IO depth 16, cpu_mask 0xFFF0, IO rate 10k, rate process “poisson”

Here is a full fio command line:
fio  --name=Job --stats=1 --group_reporting=1 --idle-prof=percpu \
--loops=1 --numjobs=1 --thread=1 --time_based=1 --runtime=30s \
--ramp_time=5s --bs=4k --size=4G --iodepth=16 --readwrite=randread \
--rwmixread=75 --randrepeat=1 --ioengine=spdk --direct=1 \
--gtod_reduce=0 --cpumask=0xFFF0 --rate_iops=10k \
--rate_process=poisson \
--filename='trtype=RDMA adrfam=IPv4 traddr=1.1.79.1 trsvcid=4420 ns=1'

SPDK allocates the following entities for every work request in
receive queue (shared or not): reqs (1024 bytes), recvs (96 bytes),
cmds (64 bytes), cpls (16 bytes), in_capsule_buffer. All except the
last one are fixed size. In capsule data size is configured to 4096.
Memory consumption calculation (target):
- Multiple SRQ: core_num * ib_devs_num * SRQ_depth * (1200 +
  in_capsule_data_size)
- Multiple RQ: queue_num * RQ_depth * (1200 + in_capsule_data_size)
We ignore admin queues in calculations for simplicity.

Cases:
1. Multiple SRQ with 1024 entries:
   - Mem = 4 * 1 * 1024 * (1200 + 4096) = 20.7 MiB
     (Constant number – does not depend on initiators number)
2. RQ with 128 entries for 64 initiators:
   - Mem = 64 * 128 * (1200 + 4096) = 41.4 MiB

Results:
FIO_JOBS   kIOPS     Bandwidth,MiB/s  AvgLatency,us  MaxResidentSize,kiB
       RQ       SRQ     RQ      SRQ    RQ       SRQ      RQ       SRQ
1      8.623    8.623   33.7    33.7   13.89    14.03    144376   155624
2      17.3     17.3    67.4    67.4   14.03    14.1     145776   155700
4      34.5     34.5    135     135    14.15    14.23    146540   156184
8      69.1     69.1    270     270    14.64    14.49    148116   156960
16     138      138     540     540    14.84    15.38    151216   158668
32     276      276     1079    1079   16.5     16.61    157560   161936
64     513      502     2005    1960   1673     1612     170408   168440
128    535      526     2092    2054   3329     3344     195796   181524
256    571      571     2232    2233   6854     6873     246484   207856

We can see the benefit in memory consumption.

Change-Id: I40c70f6ccbad7754918bcc6cb397e955b09d1033
Signed-off-by: Evgeniy Kochetov <evgeniik@mellanox.com>
Signed-off-by: Sasha Kotchubievsky <sashakot@mellanox.com>
Reviewed-on: https://review.gerrithub.io/c/spdk/spdk/+/428458


Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>

parent f186434e

CHANGELOG.md

+7 −0

Original line number	Diff line number	Diff line
		@@ -35,6 +35,13 @@ to be performed on the thread at given time.
		An new API `spdk_bdev_get_data_block_size` has been added to get size of data
		block except for metadata.

		### NVMe-oF Target

		Support for per-device shared receive queues in the RDMA transport has been added.
		The size of a shared receive queue is defined by transport configuration file parameter
		`MaxSRQDepth` and `nvmf_create_transport` RPC method parameter `max_srq_depth`.
		Default size is 4096.

		## v19.01:

		### ocf bdev

etc/spdk/nvmf.conf.in

+3 −0

Original line number	Diff line number	Diff line
		@@ -99,6 +99,9 @@
		# Set the number of shared buffers to be cached per poll group
		#BufCacheSize 32

		# Set the maximum number outstanding I/O per shared receive queue. Relevant only for RDMA transport
		#MaxSRQDepth 4096

		[Transport]
		# Set TCP transport type.
		Type TCP

include/spdk/nvmf.h

+3 −2

Original line number	Diff line number	Diff line
		/*-
		* BSD LICENSE
		*
		* Copyright (c) Intel Corporation.
		* All rights reserved.
		* Copyright (c) Intel Corporation. All rights reserved.
		* Copyright (c) 2018 Mellanox Technologies LTD. All rights reserved.
		*
		* Redistribution and use in source and binary forms, with or without
		* modification, are permitted provided that the following conditions
		@@ -72,6 +72,7 @@ struct spdk_nvmf_transport_opts {
		uint32_t max_aq_depth;
		uint32_t num_shared_buffers;
		uint32_t buf_cache_size;
		uint32_t max_srq_depth;
		};

		/**

lib/event/subsystems/nvmf/conf.c

+13 −2

Original line number	Diff line number	Diff line
		/*-
		* BSD LICENSE
		*
		* Copyright (c) Intel Corporation.
		* All rights reserved.
		* Copyright (c) Intel Corporation. All rights reserved.
		* Copyright (c) 2018 Mellanox Technologies LTD. All rights reserved.
		*
		* Redistribution and use in source and binary forms, with or without
		* modification, are permitted provided that the following conditions
		@@ -529,6 +529,17 @@ spdk_nvmf_parse_transport(struct spdk_nvmf_parse_transport_ctx *ctx)
		opts.buf_cache_size = val;
		}

		val = spdk_conf_section_get_intval(ctx->sp, "MaxSRQDepth");
		if (val >= 0) {
		if (trtype == SPDK_NVME_TRANSPORT_RDMA) {
		opts.max_srq_depth = val;
		} else {
		SPDK_ERRLOG("MaxSRQDepth is relevant only for RDMA transport '%s'\n", type);
		ctx->cb_fn(-1);
		free(ctx);
		return;
		}
		}

		transport = spdk_nvmf_transport_create(trtype, &opts);
		if (transport) {

lib/event/subsystems/nvmf/nvmf_rpc.c

+6 −2

Original line number	Diff line number	Diff line
		/*-
		* BSD LICENSE
		*
		* Copyright (c) Intel Corporation.
		* All rights reserved.
		* Copyright (c) Intel Corporation. All rights reserved.
		* Copyright (c) 2018 Mellanox Technologies LTD. All rights reserved.
		*
		* Redistribution and use in source and binary forms, with or without
		* modification, are permitted provided that the following conditions
		@@ -1435,6 +1435,10 @@ static const struct spdk_json_object_decoder nvmf_rpc_create_transport_decoder[]
		"buf_cache_size", offsetof(struct nvmf_rpc_create_transport_ctx, opts.buf_cache_size),
		spdk_json_decode_uint32, true
		},
		{
		"max_srq_depth", offsetof(struct nvmf_rpc_create_transport_ctx, opts.max_srq_depth),
		spdk_json_decode_uint32, true
		},
		};

		static void