Commit 3e04ecdd authored by Jim Harris's avatar Jim Harris Committed by Tomasz Zawadzki
Browse files

bdev_nvme: use spdk_nvme_ctrlr_fail() on ctrlr_loss_timeout



spdk_nvme_ctrlr_reconnect_async() acquires the ctrlr_lock, and
expects some later call to spdk_nvme_ctrlr_reconnect_poll_async()
to release it (one that does not return -EAGAIN).

But in the ctrlr_loss_timeout case, we stop calling
spdk_nvme_ctrlr_reconnect_async() - never giving it a chance to fail
which would release the lock. Later we detach the controller,
which destroys the ctrlr_lock mutex, with the reference still held.
This can cause various forms of corruption later, when the memory
for that mutex is allocated as part of some buffer at a later time.
Whenever any mutex on that same pthread is acquired or released,
it will modify some of that memory (since each pthread keeps a linked
list of the mutexes currently held).

So instead call spdk_nvme_ctrlr_fail() when the ctrlr_loss_timeout
happens. The next call to spdk_nvme_ctrlr_reconnect_poll_async()
will then release the lock and return failure (instead of -EAGAIN),
and continue with the detach process.

Fixes issue #3401.

Signed-off-by: default avatarJim Harris <jim.harris@samsung.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23731

 (master)

(cherry picked from commit 802d1c63)
Change-Id: I7268e7ba40df30f14e12fbfb6439e381ee1e086b
Signed-off-by: default avatarMarek Chomnicki <marek.chomnicki@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23876


Reviewed-by: default avatarJim Harris <jim.harris@samsung.com>
Reviewed-by: default avatarTomasz Zawadzki <tomasz.zawadzki@intel.com>
Tested-by: default avatarSPDK CI Jenkins <sys_sgci@intel.com>
parent b45661dd
Loading
Loading
Loading
Loading
+11 −5
Original line number Diff line number Diff line
@@ -2262,12 +2262,18 @@ bdev_nvme_reconnect_ctrlr_poll(void *arg)
	struct nvme_ctrlr *nvme_ctrlr = arg;
	int rc = -ETIMEDOUT;

	if (!bdev_nvme_check_ctrlr_loss_timeout(nvme_ctrlr)) {
	if (bdev_nvme_check_ctrlr_loss_timeout(nvme_ctrlr)) {
		/* Mark the ctrlr as failed. The next call to
		 * spdk_nvme_ctrlr_reconnect_poll_async() will then
		 * do the necessary cleanup and return failure.
		 */
		spdk_nvme_ctrlr_fail(nvme_ctrlr->ctrlr);
	}

	rc = spdk_nvme_ctrlr_reconnect_poll_async(nvme_ctrlr->ctrlr);
	if (rc == -EAGAIN) {
		return SPDK_POLLER_BUSY;
	}
	}

	spdk_poller_unregister(&nvme_ctrlr->reset_detach_poller);
	if (rc == 0) {