vllm-project/vllm

[Docs] Document NIXL KV connector metrics aggregation semantics

Open

#41,230 opened on 2026年4月29日

GitHub で見る
 (4 comments) (1 reaction) (1 assignee)Python (80,034 stars) (16,816 forks)batch import
good first issue

説明

Summary

The NIXL KV connector logs transfer metrics periodically:

KV Transfer metrics: Num successful transfers=4, Avg xfer time (ms)=1.381, P90 xfer time (ms)=2.601, Avg post time (ms)=0.672, P90 post time (ms)=0.801, Avg MB per transfer=2.25, Throughput (MB/s)=1629.549, Avg number of descriptors=72.0

Currently there is no documentation explaining what these metrics represent, especially in the context of multi-rank (TP > 1) deployments. This has already caused confusion among users.

Current behavior

All metrics are aggregated across all TP ranks before summary stats are computed:

  1. Each TP rank independently records per-transfer telemetry (transfer_duration, post_duration, bytes_transferred, num_descriptors) via NixlKVConnectorStats.record_transfer() in stats.py.
  2. Stats from all ranks are concatenated via aggregate() (list.extend()).
  3. reduce() computes averages, percentiles, and throughput over the combined pool of observations from all ranks.

This means:

  • "Num successful transfers" is the total count across all ranks, not per-rank.
  • "Avg MB per transfer" is the average over all individual rank-level transfers, not the total bytes moved for a single KV cache transfer operation.
  • "Throughput (MB/s)" is total_MB_all_ranks / total_time_all_ranks — effectively an average per-rank throughput, not the aggregate system throughput.
  • Percentiles (P90) are computed over the combined distribution of all ranks' transfer times.

This is unintuitive because users may expect metrics to reflect per-engine totals or aggregate system throughput.

What needs to be documented

  1. Docstrings in stats.py: Add clear documentation to NixlKVConnectorStats explaining that stats are aggregated across all TP ranks and what each metric represents in that context.
  2. Inline comments in reduce(): Clarify the semantics of throughput and averages — that they are per-rank averages over the combined observation pool.
  3. Docstrings in metrics.py: Document the observe()aggregate()reduce()log() pipeline and the fact that stats arrive pre-aggregated across workers.
  4. (Optional) Docs page: Add a section to the disaggregated serving documentation explaining how to interpret the KV Transfer metrics log line.

Relevant files

  • vllm/distributed/kv_transfer/kv_connector/v1/nixl/stats.pyNixlKVConnectorStats (recording, aggregation, reduction)
  • vllm/distributed/kv_transfer/kv_connector/v1/metrics.pyKVConnectorLogging (observe/log pipeline), KVConnectorStats (base class)

Context

See related discussion: metrics are aggregated across ranks rather than reported per-rank or per-engine. This is a deliberate design choice (fire-and-forget from workers), but it needs to be clearly documented so users can correctly interpret the numbers.

コントリビューターガイド