HTTP 2 connection draining is non-graceful for low-volume listeners · envoyproxy/envoy#14350

(7 comments) (0 reactions) (0 assignees)C++ (5,373 forks)batch import

area/httpbughelp wanted

Repository metrics

Stars: (27,997 stars)
PR merge metrics: (Avg merge 8d) (378 merged PRs in 30d)

Description

Title: HTTP 2 connection draining is non-graceful for low-volume listeners

Description:

When a listener enters into a draining state either due to a hot-restart or an LDS update, it should not accept new requests but should let existing requests finish gracefully. There is a drain-time limit that, when reached will terminate any open connections non-gracefully.

For HTTP 2, putting a connection into a draining state should be equivalent with Envoy sending a GOAWAY frame on the connection. This signals that no new streams should be created on the connection, but existing streams are allowed to finish. Assuming a well-behaved client (one that respects the GOAWAY), this should mirror the desired behavior for connection draining. However, Envoy does not send a GOAWAY proactively when the drain-time begins. Instead, Envoy issues a GOAWAY after the next request made on the connection is completed. Represented visually, this would look like:

With appropriately long drain-time for your traffic and a sufficiently busy listener, this delayed GOAWAY does not generally lead to issues. However, if the listener is processing a low volume of long-requests, then it is possible to find ourselves in the following scenario:

In this scenario the request does not begin until near the end of the drain-time window. Because the GOAWAY signal is not sent until the request ends. This results in a request being interrupted as the connection is non-gracefully closed -- non-graceful defined as 1) without a GOAWAY and 2) with in-flight requests.

An interrupted request is logged into the access log with the DC flag and will return a 503 response. If the downstream is another Envoy instance, then the downstream will have an access log with a UC flag.

I would expect that Envoy would issue a GOAWAY (NO_ERROR error-code) at the beginning of the drain-period (without the external tigger of a request) as this already matches the desired behavior of Envoy's connection-draining. I suspect the current implementation to be an artifact of how connections would be closed for persistent HTTP 1.

Repro steps:

This has been reproduced in our internal integration-test suite, but can be reproduced readily with the following:

Configure drain-time of 20 seconds
Configure parent-shutdown time of 25 seconds
Start Envoy
Create a client (either H1 or H2) and generate some traffic to ensure established connections
Begin a reload or perform an LDS update
Issue a request over the same connection that:
- Begins 10s after the reload/LDS-update was initiated
- Lasts for 30s (upstream service sleeps 30s before responding)
Observe non-graceful connection termination

All tests were done using a concurrency of 1 to ensure a single listener/connection.

Contributor guide

Research direction: Investigate the HTTP/2 connection drain sequence in Envoy and implement sending a GOAWAY frame at the start of the drain period.
Tech stack: cpp
Domain: backend
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Clear
Prerequisites: C++HTTP/2
Newbie friendliness: 70

Repository metrics

Description

Description:

Repro steps:

Contributor guide

Get fresh easy issues in your inbox.