vllm-project/vllm

[Docs] Document NIXL KV connector metrics aggregation semantics

Open

#41.230 geöffnet am 29. Apr. 2026

Auf GitHub ansehen
 (8 Kommentare) (1 Reaktion) (1 zugewiesene Person)Python (16.816 Forks)batch import
good first issue

Repository-Metriken

Stars
 (80.034 Stars)
PR-Merge-Metriken
 (Durchschn. Merge 9T 2h) (921 gemergte PRs in 30 T)

Beschreibung

Summary

The NIXL KV connector logs transfer metrics periodically:

KV Transfer metrics: Num successful transfers=4, Avg xfer time (ms)=1.381, P90 xfer time (ms)=2.601, Avg post time (ms)=0.672, P90 post time (ms)=0.801, Avg MB per transfer=2.25, Throughput (MB/s)=1629.549, Avg number of descriptors=72.0

Currently there is no documentation explaining what these metrics represent, especially in the context of multi-rank (TP > 1) deployments. This has already caused confusion among users.

Current behavior

All metrics are aggregated across all TP ranks before summary stats are computed:

  1. Each TP rank independently records per-transfer telemetry (transfer_duration, post_duration, bytes_transferred, num_descriptors) via NixlKVConnectorStats.record_transfer() in stats.py.
  2. Stats from all ranks are concatenated via aggregate() (list.extend()).
  3. reduce() computes averages, percentiles, and throughput over the combined pool of observations from all ranks.

This means:

  • "Num successful transfers" is the total count across all ranks, not per-rank.
  • "Avg MB per transfer" is the average over all individual rank-level transfers, not the total bytes moved for a single KV cache transfer operation.
  • "Throughput (MB/s)" is total_MB_all_ranks / total_time_all_ranks — effectively an average per-rank throughput, not the aggregate system throughput.
  • Percentiles (P90) are computed over the combined distribution of all ranks' transfer times.

This is unintuitive because users may expect metrics to reflect per-engine totals or aggregate system throughput.

What needs to be documented

  1. Docstrings in stats.py: Add clear documentation to NixlKVConnectorStats explaining that stats are aggregated across all TP ranks and what each metric represents in that context.
  2. Inline comments in reduce(): Clarify the semantics of throughput and averages — that they are per-rank averages over the combined observation pool.
  3. Docstrings in metrics.py: Document the observe()aggregate()reduce()log() pipeline and the fact that stats arrive pre-aggregated across workers.
  4. (Optional) Docs page: Add a section to the disaggregated serving documentation explaining how to interpret the KV Transfer metrics log line.

Relevant files

  • vllm/distributed/kv_transfer/kv_connector/v1/nixl/stats.pyNixlKVConnectorStats (recording, aggregation, reduction)
  • vllm/distributed/kv_transfer/kv_connector/v1/metrics.pyKVConnectorLogging (observe/log pipeline), KVConnectorStats (base class)

Context

See related discussion: metrics are aggregated across ranks rather than reported per-rank or per-engine. This is a deliberate design choice (fire-and-forget from workers), but it needs to be clearly documented so users can correctly interpret the numbers.

Contributor Guide