milvus-io/milvus

[FR] streamingcoord: support load-aware weighting in vchannelFair balancer (byte-rate or memory-pressure term)

Open

#49568 opened on May 7, 2026

View on GitHub
 (5 comments) (0 reactions) (2 assignees)Go (44,298 stars) (4,000 forks)batch import
help wantedkind/featuretriage/accepted

Description

Summary

The vchannelFair balancer (the only registered streaming.walBalancer.balancePolicy in 2.6.10–2.6.15) computes its cost function from pchannel/vchannel count deviations only:

cost = PChannelWeight    * (pDiff)^2
     + VChannelWeight    * (vDiff)^2
     + AntiAffinityWeight * (1 - affinity)

(internal/streamingcoord/server/balancer/policy/vchannelfair/expected_layout.go @ v2.6.15)

There is no byte-rate, throughput, or memory-pressure term. When per-pchannel write rates are highly skewed — e.g. one collection has bursty writes while peers are quiet — count-balance leaves the hot pchannel pinned to a single streamingnode while peers sit idle. The policy is doing exactly what it was designed for; this request is to extend it.

Reproduction profile (paraphrased)

A production cluster running Milvus 2.6.15 with embedded Woodpecker, rootCoord.dmlChannelNum=32, eight streamingnode pods at requests=limits=8 GiB, ingesting a 4M-row batch via a delete-then-upsert pattern on a high-write-rate collection.

Observed during a campaign:

Time Hot pod mem Peer pods mem Action Result
T+0 pod-A 165% of 2 GiB request 7/8 pods <10% util (none yet — diagnosed)
T+17m pod-A 94% of 4 GiB after request bump 7/8 pods 150–500 MiB killed pod-A pod-B becomes hot
T+29m pod-B 61% of 4 GiB 7/8 pods <10% util watching pattern repeats on next ramp

Bumping streamingnode count from 5 → 8 mid-campaign produced no redistribution: walBalancer left the hot pchannel where it was because count was already even. Only manual `kubectl delete pod` on the hot pod forced reassignment, and the load just rotated to another single pod.

Workarounds attempted

  1. Increase pod count — does not help. Count-balance has nothing to redistribute.
  2. Increase per-pod memory — buys time, doesn't fix the asymmetry. Pushes OOM further out, not away.
  3. Tighten `walBalancer.triggerInterval` / `minRebalanceIntervalThreshold` / `vchannelFair.rebalanceTolerance` / `rebalanceMaxStep` / `antiAffinityWeight` — shortens the duration of a hot-spot but doesn't prevent re-concentration on the next pod, because the cost function still has nothing to weight against.
  4. `limitWriting.memProtection` — global write-deny when one pod hits ~85% mem. Not a balance fix; it just hard-denies cluster-wide writes when the asymmetry causes one pod to climb. Worse than the OOM. We disable it.
  5. `shards_num` per collectionwould fix it (spreads writes across more pchannels so count-balance becomes effectively load-balance). Requires collection recreation; high operational cost.
  6. Manual operator intervention (`kubectl delete pod`) — current standing playbook. Forces reassignment but rotates the problem rather than solving it.

Proposed enhancement

Extend the `vchannelFair` cost function with an optional load-weight term:

cost += LoadWeight * (loadDiff)^2

Where `loadDiff[node]` is the per-streamingnode deviation from cluster mean of an existing prometheus signal — e.g. `streamingnode_wal_append_bytes_rate` (5–60s rolling window).

Add corresponding config keys (defaults preserve current behaviour):

streaming:
  walBalancer:
    balancePolicy:
      vchannelFair:
        loadWeight: 0.0          # default 0 = backward-compatible
        loadMetric: bytes_rate   # bytes_rate | memory | (extensible)
        loadWindow: 30s

The cost-function structure already accepts weighted squared-diff terms; this is an additive extension rather than a redesign. Operators who don't set `loadWeight` get exactly today's behaviour.

Workflow impact if implemented

  1. Eliminates manual mid-campaign rebalancing. Today an operator watches streamingnode mem skew and does `kubectl delete pod` on the hot one every 15–30 minutes during peak load. With load-weighted balance, the policy would proactively reassign the hot pchannel before mem reaches the protection threshold.
  2. Restores `memProtection` as a viable safety net. Today it's disabled because asymmetric load makes it fire as a global write-deny rather than a per-pod safety bound. Memory-aware balance would keep all pods within the protection threshold under steady load, letting `memProtection` fire only on genuine cluster-wide overload.
  3. Streamingnode horizontal scaling becomes useful again. Today adding pods doesn't help — count-balance has nothing to redistribute. Load-weighted balance lets a freshly-scaled pod absorb hot pchannels.
  4. Reduces operator paging. Hot-pod-mem OOM is currently the dominant on-call signal during heavy ingest; a load-weighted policy prevents the asymmetric climb in the first place.

Backward compatibility

`loadWeight: 0.0` default preserves current behaviour exactly. Existing `vchannelFair` deployments that don't opt in see no change. The signal source (`streamingnode_wal_append_bytes_rate`) is already exported — no new metrics infrastructure needed.

Related issues

  • #40638 — vchannels unevenly distributed (closed for 2.6.0; introduced `vchannelFair` but didn't add load awareness)
  • #46026 — streamingnode memory leak under upsert workloads (compounds the asymmetry)
  • #48564 — `sessionDiscoverer.initDiscover` retains stale streamingnode sessions, blocking the balancer (separate but adjacent)
  • #47716 — streamingnode "freeze" / drain admin path inconsistent across components, deferred to 3.0 (so manual pchannel pin is not a viable interim lever)

Contributor guide