+11
−5
Loading
spdk_nvme_ctrlr_reconnect_async() acquires the ctrlr_lock, and expects some later call to spdk_nvme_ctrlr_reconnect_poll_async() to release it (one that does not return -EAGAIN). But in the ctrlr_loss_timeout case, we stop calling spdk_nvme_ctrlr_reconnect_async() - never giving it a chance to fail which would release the lock. Later we detach the controller, which destroys the ctrlr_lock mutex, with the reference still held. This can cause various forms of corruption later, when the memory for that mutex is allocated as part of some buffer at a later time. Whenever any mutex on that same pthread is acquired or released, it will modify some of that memory (since each pthread keeps a linked list of the mutexes currently held). So instead call spdk_nvme_ctrlr_fail() when the ctrlr_loss_timeout happens. The next call to spdk_nvme_ctrlr_reconnect_poll_async() will then release the lock and return failure (instead of -EAGAIN), and continue with the detach process. Fixes issue #3401. Signed-off-by:Jim Harris <jim.harris@samsung.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23731 (master) (cherry picked from commit 802d1c63) Change-Id: I7268e7ba40df30f14e12fbfb6439e381ee1e086b Signed-off-by:
Marek Chomnicki <marek.chomnicki@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23876 Reviewed-by:
Jim Harris <jim.harris@samsung.com> Reviewed-by:
Tomasz Zawadzki <tomasz.zawadzki@intel.com> Tested-by:
SPDK CI Jenkins <sys_sgci@intel.com>