[FR] streamingcoord: support load-aware weighting in vchannelFair balancer (byte-rate or memory-pressure term)
#49568 opened on May 7, 2026
Description
Summary
The vchannelFair balancer (the only registered streaming.walBalancer.balancePolicy in 2.6.10–2.6.15) computes its cost function from pchannel/vchannel count deviations only:
cost = PChannelWeight * (pDiff)^2
+ VChannelWeight * (vDiff)^2
+ AntiAffinityWeight * (1 - affinity)
(internal/streamingcoord/server/balancer/policy/vchannelfair/expected_layout.go @ v2.6.15)
There is no byte-rate, throughput, or memory-pressure term. When per-pchannel write rates are highly skewed — e.g. one collection has bursty writes while peers are quiet — count-balance leaves the hot pchannel pinned to a single streamingnode while peers sit idle. The policy is doing exactly what it was designed for; this request is to extend it.
Reproduction profile (paraphrased)
A production cluster running Milvus 2.6.15 with embedded Woodpecker, rootCoord.dmlChannelNum=32, eight streamingnode pods at requests=limits=8 GiB, ingesting a 4M-row batch via a delete-then-upsert pattern on a high-write-rate collection.
Observed during a campaign:
| Time | Hot pod mem | Peer pods mem | Action | Result |
|---|---|---|---|---|
| T+0 | pod-A 165% of 2 GiB request |
7/8 pods <10% util | (none yet — diagnosed) | — |
| T+17m | pod-A 94% of 4 GiB after request bump |
7/8 pods 150–500 MiB | killed pod-A |
pod-B becomes hot |
| T+29m | pod-B 61% of 4 GiB |
7/8 pods <10% util | watching | pattern repeats on next ramp |
Bumping streamingnode count from 5 → 8 mid-campaign produced no redistribution: walBalancer left the hot pchannel where it was because count was already even. Only manual `kubectl delete pod` on the hot pod forced reassignment, and the load just rotated to another single pod.
Workarounds attempted
- Increase pod count — does not help. Count-balance has nothing to redistribute.
- Increase per-pod memory — buys time, doesn't fix the asymmetry. Pushes OOM further out, not away.
- Tighten `walBalancer.triggerInterval` / `minRebalanceIntervalThreshold` / `vchannelFair.rebalanceTolerance` / `rebalanceMaxStep` / `antiAffinityWeight` — shortens the duration of a hot-spot but doesn't prevent re-concentration on the next pod, because the cost function still has nothing to weight against.
- `limitWriting.memProtection` — global write-deny when one pod hits ~85% mem. Not a balance fix; it just hard-denies cluster-wide writes when the asymmetry causes one pod to climb. Worse than the OOM. We disable it.
- `shards_num` per collection — would fix it (spreads writes across more pchannels so count-balance becomes effectively load-balance). Requires collection recreation; high operational cost.
- Manual operator intervention (`kubectl delete pod`) — current standing playbook. Forces reassignment but rotates the problem rather than solving it.
Proposed enhancement
Extend the `vchannelFair` cost function with an optional load-weight term:
cost += LoadWeight * (loadDiff)^2
Where `loadDiff[node]` is the per-streamingnode deviation from cluster mean of an existing prometheus signal — e.g. `streamingnode_wal_append_bytes_rate` (5–60s rolling window).
Add corresponding config keys (defaults preserve current behaviour):
streaming:
walBalancer:
balancePolicy:
vchannelFair:
loadWeight: 0.0 # default 0 = backward-compatible
loadMetric: bytes_rate # bytes_rate | memory | (extensible)
loadWindow: 30s
The cost-function structure already accepts weighted squared-diff terms; this is an additive extension rather than a redesign. Operators who don't set `loadWeight` get exactly today's behaviour.
Workflow impact if implemented
- Eliminates manual mid-campaign rebalancing. Today an operator watches streamingnode mem skew and does `kubectl delete pod` on the hot one every 15–30 minutes during peak load. With load-weighted balance, the policy would proactively reassign the hot pchannel before mem reaches the protection threshold.
- Restores `memProtection` as a viable safety net. Today it's disabled because asymmetric load makes it fire as a global write-deny rather than a per-pod safety bound. Memory-aware balance would keep all pods within the protection threshold under steady load, letting `memProtection` fire only on genuine cluster-wide overload.
- Streamingnode horizontal scaling becomes useful again. Today adding pods doesn't help — count-balance has nothing to redistribute. Load-weighted balance lets a freshly-scaled pod absorb hot pchannels.
- Reduces operator paging. Hot-pod-mem OOM is currently the dominant on-call signal during heavy ingest; a load-weighted policy prevents the asymmetric climb in the first place.
Backward compatibility
`loadWeight: 0.0` default preserves current behaviour exactly. Existing `vchannelFair` deployments that don't opt in see no change. The signal source (`streamingnode_wal_append_bytes_rate`) is already exported — no new metrics infrastructure needed.
Related issues
- #40638 — vchannels unevenly distributed (closed for 2.6.0; introduced `vchannelFair` but didn't add load awareness)
- #46026 — streamingnode memory leak under upsert workloads (compounds the asymmetry)
- #48564 — `sessionDiscoverer.initDiscover` retains stale streamingnode sessions, blocking the balancer (separate but adjacent)
- #47716 — streamingnode "freeze" / drain admin path inconsistent across components, deferred to 3.0 (so manual pchannel pin is not a viable interim lever)