Commit 347190f1 authored by Jacek Kalwas's avatar Jacek Kalwas Committed by Tomasz Zawadzki
Browse files

nvme: fix qpair's fabric_poll_status leak



In the situation where a fabric command (either connect or auth) is
outstanding and the qpair is disconnected, the associated
fabric_poll_status can be leaked since no entity is polling the qpair
to invoke nvme_tcp_ctrlr_connect_qpair_poll and release it.

This situation occurs in the test where the target adds a namespace(s)
and there are at least two paths. On the initiator side, the first
controller is notified about the namespace via an AER and the nbdev
channel is opened for examine purposes. While the examine is happening,
the second controller is also notified about the namespace and opens
an I/O channel (adds io path to existing nbdev channel). The first
controller is able to finalize the TCP connect, fabric connect, and
examine steps, whereas the second controller is still in progress. When
the examine completes, the nbdev channel is released and the
controllers’ channels are also released (delete paths is causing
disconnects), while the second controller has not yet completed
the fabric connect.

Most likely, other scenarios exist where this leak can occur.

This fix aims for generic solution hence move similar handling of
possible leak from nvme_tcp_qpair_process_completions to
nvme_tcp_ctrlr_disconnect_qpair.

There is also dma_data leak in such situation but unfortunately such
spdk/dpdk allocations are not reported by asan.

Change-Id: I87f66efc9d7ec3a49e9bae18f9d4fbb378975d60
Signed-off-by: default avatarJacek Kalwas <jacek.kalwas@nutanix.com>
Reviewed-on: https://review.spdk.io/c/spdk/spdk/+/26479


Reviewed-by: default avatarJim Harris <jim.harris@nvidia.com>
Community-CI: Mellanox Build Bot
Tested-by: default avatarSPDK Automated Test System <spdkbot@gmail.com>
Reviewed-by: default avatarTomasz Zawadzki <tomasz@tzawadzki.com>
parent a4521c71
Loading
Loading
Loading
Loading
+13 −9
Original line number Diff line number Diff line
@@ -340,6 +340,8 @@ nvme_tcp_qpair_set_recv_state(struct nvme_tcp_qpair *tqpair,
}

static void nvme_tcp_qpair_abort_reqs(struct spdk_nvme_qpair *qpair, uint32_t dnr);
static int nvme_tcp_ctrlr_connect_qpair_poll(struct spdk_nvme_ctrlr *ctrlr,
		struct spdk_nvme_qpair *qpair);

static void
nvme_tcp_ctrlr_disconnect_qpair(struct spdk_nvme_ctrlr *ctrlr, struct spdk_nvme_qpair *qpair)
@@ -384,6 +386,17 @@ nvme_tcp_ctrlr_disconnect_qpair(struct spdk_nvme_ctrlr *ctrlr, struct spdk_nvme_
		assert(TAILQ_EMPTY(&tqpair->outstanding_reqs));
		nvme_transport_ctrlr_disconnect_qpair_done(qpair);
	}

	/* A non-NULL fabric poll status indicates that a fabric command was outstanding
	 * and the qpair state was CONNECTING before the disconnect was invoked. That
	 * command was aborted by the socket close. To avoid leaking this status and dma_data,
	 * nvme_tcp_ctrlr_connect_qpair_poll is used to releases them. */
	if (qpair->fabric_poll_status != NULL) {
		assert(qpair->fabric_poll_status->done);
		rc = nvme_tcp_ctrlr_connect_qpair_poll(qpair->ctrlr, qpair);
		assert(rc != -EAGAIN);
		assert(!qpair->fabric_poll_status);
	}
}

static int
@@ -2088,9 +2101,6 @@ nvme_tcp_qpair_check_timeout(struct spdk_nvme_qpair *qpair)
	}
}

static int nvme_tcp_ctrlr_connect_qpair_poll(struct spdk_nvme_ctrlr *ctrlr,
		struct spdk_nvme_qpair *qpair);

static int
nvme_tcp_qpair_process_completions(struct spdk_nvme_qpair *qpair, uint32_t max_completions)
{
@@ -2149,12 +2159,6 @@ nvme_tcp_qpair_process_completions(struct spdk_nvme_qpair *qpair, uint32_t max_c
fail:
	qpair->transport_failure_reason = SPDK_NVME_QPAIR_FAILURE_UNKNOWN;
	nvme_ctrlr_disconnect_qpair(qpair);

	/* Needed to free the poll_status */
	if (qpair->fabric_poll_status != NULL) {
		nvme_tcp_ctrlr_connect_qpair_poll(qpair->ctrlr, qpair);
	}

	return -ENXIO;
}