Description
Description Overload Manager is really helpful in protecting Envoy from resource exhaustion. However, tuning the triggers (e.g., heap usage, system load) and actions (e.g., stop_accepting_requests, disable_http_keepalive) in a live production environment carries significant risk. An overly aggressive configuration can inadvertently shed legitimate traffic or cause cascading failures before the operator realizes the thresholds are misconfigured.
Currently, there is no native way to validate an Overload Manager configuration against real production traffic patterns without actually enforcing the shedding logic.
Proposal Adding a Dry-Run Mode to the Overload Manager bootstrap configuration.
When enabled, this mode would allow the Overload Manager to evaluate all triggers and actions as usual, but prevent the actual enforcement of those actions (e.g., connections would not be rejected, keepalives would not be disabled). Instead, it would emit specific telemetry indicating that an action would have been taken.
Configuration Changes We would extend the envoy.config.overload.v3.OverloadManager bootstrap configuration.
Protocol Buffers
message OverloadManager {
// ... existing fields ...
// If true, the Overload Manager will track resource usage and triggers,
// but will not enforce any actions. Instead, it will emit stats indicating
// which actions would have been active.
bool dry_run = 3;
}
Desired Behavior
Signal Evaluation: The Overload Manager continues to monitor resource monitors (e.g., fixed_heap, min_memory) and calculate saturation.
Action Bypass: When a trigger threshold is breached:
Standard Mode: The associated action (e.g., envoy.overload_actions.stop_accepting_connections) transitions to an active state, and traffic is shed.
Dry-Run Mode: The action state is calculated effectively as "active" for observability purposes, but the check within the connection lifecycle ignores the signal and proceeds to process the traffic.
Observability:
(Optional) Emit shadow metrics, such as overload.dry_run.action..active or overload.dry_run.dropped_count. (Optional) Log a warning at debug or info level indicating "Overload Action [X] triggered (Dry-Run: suppressed)."
This feature enables "Safety First" deployments. Operators can:
- Deploy a new Overload Manager configuration to production with dry_run: true.
- Monitor metrics to see if the thresholds (e.g., 90% heap) are flapping or triggering too frequently during normal bursts.
- Tune the configuration based on real-world data.
- Flip dry_run to false via a config update once confident.
Alternatives Considered
- Running a sidecar Envoy with the config just to test behavior. This is operationally expensive and doesn't capture the exact resource pressure of the primary container.
- Synthetic Load Testing rarely match the unpredictability of production traffic spikes.
I am happy to work on a PR for this if the maintainers agree with the design direction.