Commit 3e04ecdd authored Jun 14, 2024 by Jim Harris Committed by Tomasz Zawadzki Jul 01, 2024

bdev_nvme: use spdk_nvme_ctrlr_fail() on ctrlr_loss_timeout



spdk_nvme_ctrlr_reconnect_async() acquires the ctrlr_lock, and
expects some later call to spdk_nvme_ctrlr_reconnect_poll_async()
to release it (one that does not return -EAGAIN).

But in the ctrlr_loss_timeout case, we stop calling
spdk_nvme_ctrlr_reconnect_async() - never giving it a chance to fail
which would release the lock. Later we detach the controller,
which destroys the ctrlr_lock mutex, with the reference still held.
This can cause various forms of corruption later, when the memory
for that mutex is allocated as part of some buffer at a later time.
Whenever any mutex on that same pthread is acquired or released,
it will modify some of that memory (since each pthread keeps a linked
list of the mutexes currently held).

So instead call spdk_nvme_ctrlr_fail() when the ctrlr_loss_timeout
happens. The next call to spdk_nvme_ctrlr_reconnect_poll_async()
will then release the lock and return failure (instead of -EAGAIN),
and continue with the detach process.

Fixes issue #3401.

Signed-off-by: Jim Harris <jim.harris@samsung.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23731

 (master)

(cherry picked from commit 802d1c63)
Change-Id: I7268e7ba40df30f14e12fbfb6439e381ee1e086b
Signed-off-by: Marek Chomnicki <marek.chomnicki@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23876


Reviewed-by: Jim Harris <jim.harris@samsung.com>
Reviewed-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>

parent b45661dd

module/bdev/nvme/bdev_nvme.c

+11 −5

Original line number	Diff line number	Diff line
		@@ -2262,12 +2262,18 @@ bdev_nvme_reconnect_ctrlr_poll(void *arg)
		struct nvme_ctrlr *nvme_ctrlr = arg;
		int rc = -ETIMEDOUT;

		if (!bdev_nvme_check_ctrlr_loss_timeout(nvme_ctrlr)) {
		if (bdev_nvme_check_ctrlr_loss_timeout(nvme_ctrlr)) {
		/* Mark the ctrlr as failed. The next call to
		* spdk_nvme_ctrlr_reconnect_poll_async() will then
		* do the necessary cleanup and return failure.
		*/
		spdk_nvme_ctrlr_fail(nvme_ctrlr->ctrlr);
		}

		rc = spdk_nvme_ctrlr_reconnect_poll_async(nvme_ctrlr->ctrlr);
		if (rc == -EAGAIN) {
		return SPDK_POLLER_BUSY;
		}
		}

		spdk_poller_unregister(&nvme_ctrlr->reset_detach_poller);
		if (rc == 0) {