+6
−36
+2
−0
Loading
The current method for retrying CONNECTs is not reliable, in fact we have been seeing a lot of CI failures around the CONNECTs timing out. We can actually make this much more reliable, and also way simpler. Each sgroup already has a TAILQ of requests that will get retried when the sgroup gets resumed. So just put the CONNECT request on that TAILQ, and modify the resume logic to send these commands down the exec_fabrics() path. We can remove all of the timeout related code for this now too. We know that these CONNECTs will get retried when the subsystem is in RESUMING state. Tested this using a local patch that would inject a 10ms delay (using a timed poller) before starting a RESUME, with debug prints showing when a CONNECT was queued, and running the connect_stress.sh in a loop on my test system. Fixes issue #3095. Signed-off-by:Jim Harris <jim.harris@samsung.com> Change-Id: I06ae83399b91e63b590f88bf420e3cba2149223a Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/21392 Community-CI: Mellanox Build Bot Tested-by:
SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by:
Konrad Sztyber <konrad.sztyber@intel.com> Reviewed-by:
Tomasz Zawadzki <tomasz.zawadzki@intel.com>