skypilot-org/skypilot
View on GitHubserve: autoscaler `latest_version` resets on controller restart, causing scale churn
Open
#8562 opened on Jan 13, 2026
good first issuegood starter issues
Description
Summary
- After Serve controller restarts, autoscaler latest_version stays at INITIAL_VERSION (1) while replicas are launched at the latest service version (e.g., 3).
- This mismatch causes the autoscaler to repeatedly scale up and then immediately scale down the newly launched replicas, even at 0 RPS.
What we saw
- One of our service deployments had a constant loop: autoscaler requests 3 scale‑ups (min=2 + overprovision=1), then scales down the same new replicas.
- The loop starts right after the controller restarts.
- DB state (example): - services shows current_version=1 but active_versions=[3]. - version_specs only contains version=3.
- Logs show: - Requests per second 0.0, target replicas computed as 0 or 2, but scale‑ups still requested. - Immediate scale‑down of the replicas that were just launched.
Repro
- Launch a service and update it to version > 1.
- Restart the controller (or let K8s reschedule it).
- Observe autoscaler logs: continuous scale up/down churn, even at 0 RPS.
I can provide a suggestion/fix in a bit.