skypilot-org/skypilot

serve: autoscaler `latest_version` resets on controller restart, causing scale churn

Open

Aperta il 13 gen 2026

Vedi su GitHub
 (3 commenti) (0 reazioni) (0 assegnatari)Python (4859 star) (311 fork)batch import
good first issuegood starter issues

Descrizione

Summary

  • After Serve controller restarts, autoscaler latest_version stays at INITIAL_VERSION (1) while replicas are launched at the latest service version (e.g., 3).
  • This mismatch causes the autoscaler to repeatedly scale up and then immediately scale down the newly launched replicas, even at 0 RPS.

What we saw

  • One of our service deployments had a constant loop: autoscaler requests 3 scale‑ups (min=2 + overprovision=1), then scales down the same new replicas.
  • The loop starts right after the controller restarts.
  • DB state (example): - services shows current_version=1 but active_versions=[3]. - version_specs only contains version=3.
  • Logs show: - Requests per second 0.0, target replicas computed as 0 or 2, but scale‑ups still requested. - Immediate scale‑down of the replicas that were just launched.

Repro

  1. Launch a service and update it to version > 1.
  2. Restart the controller (or let K8s reschedule it).
  3. Observe autoscaler logs: continuous scale up/down churn, even at 0 RPS.

I can provide a suggestion/fix in a bit.

Guida contributor