vllm-project/vllm

[Docs] Document NIXL KV connector metrics aggregation semantics

Open

#41230 opened on Apr 29, 2026

View on GitHub
 (4 comments) (1 reaction) (1 assignee)Python (80,034 stars) (16,816 forks)batch import
good first issue

Description

Summary

The NIXL KV connector logs transfer metrics periodically:

KV Transfer metrics: Num successful transfers=4, Avg xfer time (ms)=1.381, P90 xfer time (ms)=2.601, Avg post time (ms)=0.672, P90 post time (ms)=0.801, Avg MB per transfer=2.25, Throughput (MB/s)=1629.549, Avg number of descriptors=72.0

Currently there is no documentation explaining what these metrics represent, especially in the context of multi-rank (TP > 1) deployments. This has already caused confusion among users.

Current behavior

All metrics are aggregated across all TP ranks before summary stats are computed:

  1. Each TP rank independently records per-transfer telemetry (transfer_duration, post_duration, bytes_transferred, num_descriptors) via NixlKVConnectorStats.record_transfer() in stats.py.
  2. Stats from all ranks are concatenated via aggregate() (list.extend()).
  3. reduce() computes averages, percentiles, and throughput over the combined pool of observations from all ranks.

This means:

  • "Num successful transfers" is the total count across all ranks, not per-rank.
  • "Avg MB per transfer" is the average over all individual rank-level transfers, not the total bytes moved for a single KV cache transfer operation.
  • "Throughput (MB/s)" is total_MB_all_ranks / total_time_all_ranks — effectively an average per-rank throughput, not the aggregate system throughput.
  • Percentiles (P90) are computed over the combined distribution of all ranks' transfer times.

This is unintuitive because users may expect metrics to reflect per-engine totals or aggregate system throughput.

What needs to be documented

  1. Docstrings in stats.py: Add clear documentation to NixlKVConnectorStats explaining that stats are aggregated across all TP ranks and what each metric represents in that context.
  2. Inline comments in reduce(): Clarify the semantics of throughput and averages — that they are per-rank averages over the combined observation pool.
  3. Docstrings in metrics.py: Document the observe()aggregate()reduce()log() pipeline and the fact that stats arrive pre-aggregated across workers.
  4. (Optional) Docs page: Add a section to the disaggregated serving documentation explaining how to interpret the KV Transfer metrics log line.

Relevant files

  • vllm/distributed/kv_transfer/kv_connector/v1/nixl/stats.pyNixlKVConnectorStats (recording, aggregation, reduction)
  • vllm/distributed/kv_transfer/kv_connector/v1/metrics.pyKVConnectorLogging (observe/log pipeline), KVConnectorStats (base class)

Context

See related discussion: metrics are aggregated across ranks rather than reported per-rank or per-engine. This is a deliberate design choice (fire-and-forget from workers), but it needs to be clearly documented so users can correctly interpret the numbers.

Contributor guide