skypilot-org/skypilot

serve: autoscaler `latest_version` resets on controller restart, causing scale churn

Open

#8,562 建立於 2026年1月13日

在 GitHub 查看
 (3 留言) (0 反應) (0 負責人)Python (4,859 star) (311 fork)batch import
good first issuegood starter issues

描述

Summary

  • After Serve controller restarts, autoscaler latest_version stays at INITIAL_VERSION (1) while replicas are launched at the latest service version (e.g., 3).
  • This mismatch causes the autoscaler to repeatedly scale up and then immediately scale down the newly launched replicas, even at 0 RPS.

What we saw

  • One of our service deployments had a constant loop: autoscaler requests 3 scale‑ups (min=2 + overprovision=1), then scales down the same new replicas.
  • The loop starts right after the controller restarts.
  • DB state (example): - services shows current_version=1 but active_versions=[3]. - version_specs only contains version=3.
  • Logs show: - Requests per second 0.0, target replicas computed as 0 or 2, but scale‑ups still requested. - Immediate scale‑down of the replicas that were just launched.

Repro

  1. Launch a service and update it to version > 1.
  2. Restart the controller (or let K8s reschedule it).
  3. Observe autoscaler logs: continuous scale up/down churn, even at 0 RPS.

I can provide a suggestion/fix in a bit.

貢獻者指南