Commit 4985289a authored May 22, 2024 by Alexey Marchuk Committed by Tomasz Zawadzki Jun 14, 2024

nvme/rdma: Lock mutex when destroying lingering qpair



Handling of lingering qpair destruction is different from
regular destroy/disconnect path, controller's mutex is
not locked in that case. That can lead to a race condition
when nvme_rdma_qpair_destroy iterates controller's outstanding
rdma_cm events and acks ones belonging to the qpair while
another thread might be polling rdma cm events and reap an
events for the qpair being destroyed. In that case we attempt
to destroy rdma_cm id which has unprocessed events and
rdma_destroy_id stucks.

Fixes issue #3347

Signed-off-by: Alexey Marchuk <alexeymar@nvidia.com>
Change-Id: I3470c6080e2c19a63eb65eecc398dccd92327eb9
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23324


Reviewed-by: Shuhei Matsumoto <smatsumoto@nvidia.com>
Reviewed-by: Ben Walker <ben@nvidia.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Community-CI: Mellanox Build Bot

parent 956fd5e1

lib/nvme/nvme_rdma.c

+9 −1

Original line number	Diff line number	Diff line
		@@ -1963,6 +1963,9 @@ quiet:
		static int
		nvme_rdma_qpair_wait_until_quiet(struct nvme_rdma_qpair *rqpair)
		{
		struct spdk_nvme_qpair *qpair = &rqpair->qpair;
		struct spdk_nvme_ctrlr *ctrlr = qpair->ctrlr;

		if (spdk_get_ticks() < rqpair->evt_timeout_ticks &&
		(rqpair->current_num_sends != 0 \|\|
		(!rqpair->srq && rqpair->rsps->current_num_recvs != 0))) {
		@@ -1970,9 +1973,14 @@ nvme_rdma_qpair_wait_until_quiet(struct nvme_rdma_qpair *rqpair)
		}

		rqpair->state = NVME_RDMA_QPAIR_STATE_EXITED;

		nvme_rdma_qpair_abort_reqs(&rqpair->qpair, 0);
		if (!nvme_qpair_is_admin_queue(qpair)) {
		nvme_robust_mutex_lock(&ctrlr->ctrlr_lock);
		}
		nvme_rdma_qpair_destroy(rqpair);
		if (!nvme_qpair_is_admin_queue(qpair)) {
		nvme_robust_mutex_unlock(&ctrlr->ctrlr_lock);
		}
		nvme_transport_ctrlr_disconnect_qpair_done(&rqpair->qpair);

		return 0;

Admin message