Description
Feature Overview
In NGINX Gateway Fabric, we're looking to support Gateway API HTTPRoute Retries.
The API defines the following fields relating to retries:
codesdefines the HTTP response status codes for which a backend request should be retried.attemptsspecifies the maximum number of times an individual request from the gateway to a backend should be retried. If the maximum number of retries has been attempted without a successful response from the backend, the Gateway MUST return an error.backoffspecifies the minimum duration a Gateway should wait between retry attempts. An implementation MAY use an exponential or alternative backoff strategy for subsequent retry attempts, MAY cap the maximum backoff duration to some amount greater than the specified minimum, and MAY add arbitrary jitter to stagger requests, as long as unsuccessful backend requests are not retried before the configured minimum duration.
backend = upstream gateway/implementation = NGINX
This feature has a relationship with HTTPRoute Timeouts. See the Gateway API reference for more details.
The existing proxy_next_upstream directives seem to get close to supporting this feature, but not quite what we need.
The idea is that NGINX Gateway Fabric would take the values specified by a user in these two API fields and convert them to an NGINX configuration that satisfies the requirements set by the API specification.
Alternatives Considered
We originally looked at how to utilize proxy_next_upstream for this, however I don't believe it behaves quite how we want. This functionality retries the next server and then errors out if none are left. If a user wants 5 attempts, but only has 1 upstream server, this won't work. There could be a workaround by duplicating the same server 5 times in the upstream, but this seems a bit hacky. They also may have 3 servers, so which one would we duplicate? It's not a clean solution.
Also proxy_next_upstream_timeout appears to be a full timeout, instead of per-retry. It also doesn't support exponential backoff or jitter, which isn't technically required by the spec, but may be nice to support anyway.
Additional Context
API specification: https://gateway-api.sigs.k8s.io/reference/spec/#httprouteretry API design/proposal: https://gateway-api.sigs.k8s.io/geps/gep-1731/