skypilot-org/skypilot

serve: autoscaler `latest_version` resets on controller restart, causing scale churn

Open

#8562 opened on Jan 13, 2026

View on GitHub
 (3 comments) (0 reactions) (0 assignees)Python (4,859 stars) (311 forks)batch import
good first issuegood starter issues

Description

Summary

  • After Serve controller restarts, autoscaler latest_version stays at INITIAL_VERSION (1) while replicas are launched at the latest service version (e.g., 3).
  • This mismatch causes the autoscaler to repeatedly scale up and then immediately scale down the newly launched replicas, even at 0 RPS.

What we saw

  • One of our service deployments had a constant loop: autoscaler requests 3 scale‑ups (min=2 + overprovision=1), then scales down the same new replicas.
  • The loop starts right after the controller restarts.
  • DB state (example): - services shows current_version=1 but active_versions=[3]. - version_specs only contains version=3.
  • Logs show: - Requests per second 0.0, target replicas computed as 0 or 2, but scale‑ups still requested. - Immediate scale‑down of the replicas that were just launched.

Repro

  1. Launch a service and update it to version > 1.
  2. Restart the controller (or let K8s reschedule it).
  3. Observe autoscaler logs: continuous scale up/down churn, even at 0 RPS.

I can provide a suggestion/fix in a bit.

Contributor guide