+20
−2
Loading
There was a complex issue that failover was lost and I/O qpair was never created again if fabric connect command got timeout for I/O qpair while controller was being reset. To create I/O qpair for such case, add a boolean pending_failover variable to nvme_ctrlr structure, When bdev_nvme_failover() is called, if nvme_ctrlr->resetting is true, set pending_failover to true and return. Then, at _bdev_nvme_reset_complete() if pending_failover is true, call set failover_pending to false and call bdev_nvme_failover(). However, we have to be more careful. most SPDK threads call bdev_nvme_failover() almost simultaneously for a network error. For this case, we have to call bdev_nvme_failover() only once per network error. To do this, add and use another boolean variable in_failover. After this change, bdev_nvme_failover() call is not lost but deferred. Hence, use -EINPROGRESS instead of -EBUSY for clarification. Verify this change by adding a unit test case. NOTE: Better practical workaround will be to extend timeout for fabric connect command. While fabric connect command is in progress, I/Os are queued even if the upper layer does not enable I/O error resiliency. But, this fix will be necessary. Otherwise, connection establishment is not retried. Signed-off-by:Shuhei Matsumoto <smatsumoto@nvidia.com> Change-Id: Ibe346b8ae35cab5bd2bcbda1aaa12d2d9364e283 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/18209 Reviewed-by:
Aleksey Marchuk <alexeymar@nvidia.com> Reviewed-by:
Jim Harris <james.r.harris@intel.com> Tested-by:
SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot