Commit 365db9ee authored by Ben Walker's avatar Ben Walker Committed by Tomasz Zawadzki
Browse files

sock/posix: Deal with hung I/O with MSG_ZEROCOPY and interrupt


suppression

When all of the following conditions are met:
- non-blocking socket
- zero copy is enabled
- interrupts are suppressed (i.e. busy polling)
- NIC tx queue is full at the time sendmsg() is called
- epoll_wait sees there is already an EPOLLIN event
then we can get into a situation where data we've sent is queued
up in the kernel network stack, but interrupts have been suppressed
because other traffic is flowing. This makes the kernel miss the
signal to flush the software tx queue. If there wasn't also already
a pending EPOLLIN event, then epoll_wait would have been sufficient
to kick the system out of this state. But when all of this aligns,
it hangs.

We deal with this by detecting the scenario and calling poll(), which
will force the kernel to issue the pending transmits.

Change-Id: Ifb247159b7de16c8fc72a90f0333f5b421c8bd07
Signed-off-by: default avatarBen Walker <benjamin.walker@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/6750


Tested-by: default avatarSPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: default avatarJim Harris <james.r.harris@intel.com>
Reviewed-by: default avatarTomasz Zawadzki <tomasz.zawadzki@intel.com>
Community-CI: Mellanox Build Bot
parent 97b0c5d3
Loading
Loading
Loading
Loading
+45 −0
Original line number Diff line number Diff line
@@ -1254,6 +1254,51 @@ posix_sock_group_impl_poll(struct spdk_sock_group_impl *_group, int max_events,
	struct timespec ts = {0};
#endif

#ifdef SPDK_ZEROCOPY
	/* When all of the following conditions are met
	 * - non-blocking socket
	 * - zero copy is enabled
	 * - interrupts suppressed (i.e. busy polling)
	 * - the NIC tx queue is full at the time sendmsg() is called
	 * - epoll_wait determines there is an EPOLLIN event for the socket
	 * then we can get into a situation where data we've sent is queued
	 * up in the kernel network stack, but interrupts have been suppressed
	 * because other traffic is flowing so the kernel misses the signal
	 * to flush the software tx queue. If there wasn't incoming data
	 * pending on the socket, then epoll_wait would have been sufficient
	 * to kick off the send operation, but since there is a pending event
	 * epoll_wait does not trigger the necessary operation.
	 *
	 * We deal with this by checking for all of the above conditions and
	 * additionally looking for EPOLLIN events that were not consumed from
	 * the last poll loop. We take this to mean that the upper layer is
	 * unable to consume them because it is blocked waiting for resources
	 * to free up, and those resources are most likely freed in response
	 * to a pending asynchronous write completing.
	 *
	 * Additionally, sockets that have the same placement_id actually share
	 * an underlying hardware queue. That means polling one of them is
	 * equivalent to polling all of them. As a quick mechanism to avoid
	 * making extra poll() calls, stash the last placement_id during the loop
	 * and only poll if it's not the same. The overwhelmingly common case
	 * is that all sockets in this list have the same placement_id because
	 * SPDK is intentionally grouping sockets by that value, so even
	 * though this won't stop all extra calls to poll(), it's very fast
	 * and will catch all of them in practice.
	 */
	int last_placement_id = -1;

	TAILQ_FOREACH(psock, &group->pending_events, link) {
		if (psock->zcopy && psock->placement_id >= 0 &&
		    psock->placement_id != last_placement_id) {
			struct pollfd pfd = {psock->fd, POLLIN | POLLERR, 0};

			poll(&pfd, 1, 0);
			last_placement_id = psock->placement_id;
		}
	}
#endif

	/* This must be a TAILQ_FOREACH_SAFE because while flushing,
	 * a completion callback could remove the sock from the
	 * group. */