vllm-project/vllm
View on GitHub[Docs] Document NIXL KV connector metrics aggregation semantics
Open
#41230 opened on Apr 29, 2026
good first issue
Description
Summary
The NIXL KV connector logs transfer metrics periodically:
KV Transfer metrics: Num successful transfers=4, Avg xfer time (ms)=1.381, P90 xfer time (ms)=2.601, Avg post time (ms)=0.672, P90 post time (ms)=0.801, Avg MB per transfer=2.25, Throughput (MB/s)=1629.549, Avg number of descriptors=72.0
Currently there is no documentation explaining what these metrics represent, especially in the context of multi-rank (TP > 1) deployments. This has already caused confusion among users.
Current behavior
All metrics are aggregated across all TP ranks before summary stats are computed:
- Each TP rank independently records per-transfer telemetry (
transfer_duration,post_duration,bytes_transferred,num_descriptors) viaNixlKVConnectorStats.record_transfer()instats.py. - Stats from all ranks are concatenated via
aggregate()(list.extend()). reduce()computes averages, percentiles, and throughput over the combined pool of observations from all ranks.
This means:
- "Num successful transfers" is the total count across all ranks, not per-rank.
- "Avg MB per transfer" is the average over all individual rank-level transfers, not the total bytes moved for a single KV cache transfer operation.
- "Throughput (MB/s)" is
total_MB_all_ranks / total_time_all_ranks— effectively an average per-rank throughput, not the aggregate system throughput. - Percentiles (P90) are computed over the combined distribution of all ranks' transfer times.
This is unintuitive because users may expect metrics to reflect per-engine totals or aggregate system throughput.
What needs to be documented
- Docstrings in
stats.py: Add clear documentation toNixlKVConnectorStatsexplaining that stats are aggregated across all TP ranks and what each metric represents in that context. - Inline comments in
reduce(): Clarify the semantics of throughput and averages — that they are per-rank averages over the combined observation pool. - Docstrings in
metrics.py: Document theobserve()→aggregate()→reduce()→log()pipeline and the fact that stats arrive pre-aggregated across workers. - (Optional) Docs page: Add a section to the disaggregated serving documentation explaining how to interpret the KV Transfer metrics log line.
Relevant files
vllm/distributed/kv_transfer/kv_connector/v1/nixl/stats.py—NixlKVConnectorStats(recording, aggregation, reduction)vllm/distributed/kv_transfer/kv_connector/v1/metrics.py—KVConnectorLogging(observe/log pipeline),KVConnectorStats(base class)
Context
See related discussion: metrics are aggregated across ranks rather than reported per-rank or per-engine. This is a deliberate design choice (fire-and-forget from workers), but it needs to be clearly documented so users can correctly interpret the numbers.