Support more robust retries · nginx/nginx#1263

Repository metrics

Stars: (30,331 stars)
PR merge metrics: (Avg merge 27d 8h) (17 merged PRs in 30d)

Description

Feature Overview

In NGINX Gateway Fabric, we're looking to support Gateway API HTTPRoute Retries.

The API defines the following fields relating to retries:

codes defines the HTTP response status codes for which a backend request should be retried.
attempts specifies the maximum number of times an individual request from the gateway to a backend should be retried. If the maximum number of retries has been attempted without a successful response from the backend, the Gateway MUST return an error.
backoff specifies the minimum duration a Gateway should wait between retry attempts. An implementation MAY use an exponential or alternative backoff strategy for subsequent retry attempts, MAY cap the maximum backoff duration to some amount greater than the specified minimum, and MAY add arbitrary jitter to stagger requests, as long as unsuccessful backend requests are not retried before the configured minimum duration.

backend = upstream gateway/implementation = NGINX

This feature has a relationship with HTTPRoute Timeouts. See the Gateway API reference for more details.

The existing proxy_next_upstream directives seem to get close to supporting this feature, but not quite what we need.

The idea is that NGINX Gateway Fabric would take the values specified by a user in these two API fields and convert them to an NGINX configuration that satisfies the requirements set by the API specification.

Alternatives Considered

We originally looked at how to utilize proxy_next_upstream for this, however I don't believe it behaves quite how we want. This functionality retries the next server and then errors out if none are left. If a user wants 5 attempts, but only has 1 upstream server, this won't work. There could be a workaround by duplicating the same server 5 times in the upstream, but this seems a bit hacky. They also may have 3 servers, so which one would we duplicate? It's not a clean solution.

Also proxy_next_upstream_timeout appears to be a full timeout, instead of per-retry. It also doesn't support exponential backoff or jitter, which isn't technically required by the spec, but may be nice to support anyway.

Additional Context

API specification: https://gateway-api.sigs.k8s.io/reference/spec/#httprouteretry API design/proposal: https://gateway-api.sigs.k8s.io/geps/gep-1731/

Contributor guide

Research direction: Investigate existing NGINX retry mechanisms like proxy next upstream and explore how to implement exponential backoff and jitter. Look at the Gateway API specification for retries and design a configuration mapping.
Tech stack: c
Domain: backendapi
Issue type: Feature
Difficulty: 3
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: CNGINX internals
Newbie friendliness: 30