+11
−5
Loading
spdk_nvme_ctrlr_reconnect_async() acquires the ctrlr_lock, and expects some later call to spdk_nvme_ctrlr_reconnect_poll_async() to release it (one that does not return -EAGAIN). But in the ctrlr_loss_timeout case, we stop calling spdk_nvme_ctrlr_reconnect_async() - never giving it a chance to fail which would release the lock. Later we detach the controller, which destroys the ctrlr_lock mutex, with the reference still held. This can cause various forms of corruption later, when the memory for that mutex is allocated as part of some buffer at a later time. Whenever any mutex on that same pthread is acquired or released, it will modify some of that memory (since each pthread keeps a linked list of the mutexes currently held). So instead call spdk_nvme_ctrlr_fail() when the ctrlr_loss_timeout happens. The next call to spdk_nvme_ctrlr_reconnect_poll_async() will then release the lock and return failure (instead of -EAGAIN), and continue with the detach process. Fixes issue #3401. Signed-off-by:Jim Harris <jim.harris@samsung.com> Change-Id: I7268e7ba40df30f14e12fbfb6439e381ee1e086b Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23731 Reviewed-by:
Konrad Sztyber <konrad.sztyber@intel.com> Reviewed-by:
Shuhei Matsumoto <smatsumoto@nvidia.com> Tested-by:
SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot