Métricas do repositório

Stars: (80.034 stars)
Métricas de merge de PR: (Mesclagem média 9d 2h) (921 fundiu PRs em 30d)

Description

🚀 The feature, motivation and pitch

Description

This RFC tracks the current state and planned improvements for Prefill-Decode (P/D) Disaggregation using the NixlConnector, which enables high-performance KV cache transfer between prefill and decode instances using the NIXL library.

Currently Supported Features

Core Infrastructure

NIXL Integration - Core P/D disaggregation framework (https://github.com/vllm-project/vllm/pull/17751)

Async KV Cache Transfers

Fully asynchronous KV cache transfers
- https://github.com/vllm-project/vllm/pull/33377 - Bugfix for async scheduling + request abort + async KV transfer
- https://github.com/vllm-project/vllm/pull/28327 - Simplify async KV output aggregation
- https://github.com/vllm-project/vllm/pull/27648 - Async scheduling support
- https://github.com/vllm-project/vllm/pull/31583 - Fix resuming preempted requests after async load

Multi-Transport Backend Support

Multi-transport backend support - UCX (default), LIBFABRIC, and other NIXL plugins
- ROCm support through RIXL library
- Support for OOT NIXL backends via kv_connector_extra_config (https://github.com/vllm-project/vllm/pull/33552)

Tensor Parallelism

Homogeneous Tensor Parallelism - P and D instances with matching TP sizes (https://github.com/vllm-project/vllm/pull/17751)
Heterogeneous Tensor Parallelism - Support for different TP sizes between P and D
- https://github.com/vllm-project/vllm/pull/18833 - Base heterogeneous D TP > P TP support
- https://github.com/vllm-project/vllm/pull/20189 - Heterogeneous TP for FlashInfer
- https://github.com/vllm-project/vllm/pull/27274 - P TP > D TP support (including MLA use-case)

MLA

Add support for MLA caches with different latent dim (Deepseek v3.2 Indexer)
- https://github.com/vllm-project/vllm/pull/25902

CPU Host Buffer Transfers

CPU host buffer transfers - Support for platforms without direct NIXL GPU-GPU transfer (D2H->H2D), for TPU, XPU and more.
- https://github.com/vllm-project/vllm/pull/18293 - Base CPU transfer support
- https://github.com/vllm-project/vllm/pull/24690 - CUDA to CPU memory transfers
- https://github.com/vllm-project/vllm/pull/28356 - Pure CPU environment support

Heterogeneous Configurations The following also partially enable Hybrid hardware deployment among other use-cases.

Support kernel_block_size != block_size (logical <> physical block_size mismatch)
- https://github.com/vllm-project/vllm/pull/30692 -
Heterogeneous block sizes - Different block sizes between P and D instances (cc @xuechendi )
- https://github.com/vllm-project/vllm/pull/26759 - Heterogeneous block_size support
- https://github.com/vllm-project/vllm/pull/30275 - Decoder-side post-processing for heterogeneous BlockSize
Heterogeneous KV layout (experimental) - HND to NHD permutation via enable_permute_local_kv
- https://github.com/vllm-project/vllm/pull/27743 - Cross-layer KV blocks support
- https://github.com/vllm-project/vllm/pull/30275 - Heterogeneous layout handling

Reliability & Observability

Compatibility hash validation - Automatic P/D configuration compatibility checking
- https://github.com/vllm-project/vllm/pull/29503 - Compatibility checking in NIXL handshake
- https://github.com/vllm-project/vllm/pull/30182 - Debug logging level for compatibility hash
Transfer failure handling - Block invalidation and kv_load_failure_policy (fail/recompute)
- https://github.com/vllm-project/vllm/pull/32031 - Failure logging overhaul + early metadata free
- https://github.com/vllm-project/vllm/pull/28120 - Avoid NIXL_ERR_REMOTE_DISCONNECT on prefill failure
- https://github.com/vllm-project/vllm/pull/32198 - Document fail kv_load_failure_policy
- https://github.com/vllm-project/vllm/pull/29665 - Add remote_request_id for better tracking
NIXL telemetry and metrics - Transfer duration, throughput, failure counters (Prometheus)
- https://github.com/vllm-project/vllm/pull/25388 - Expose NIXL metrics for CLI logging
- https://github.com/vllm-project/vllm/pull/22188 - KVTransferMetrics aggregation strategy
- https://github.com/vllm-project/vllm/pull/32340 - Track nixl_num_kv_expired_reqs in Prometheus
- https://github.com/vllm-project/vllm/pull/28309 - KV events infrastructure
Request timeout/expiration - Automatic KV block release on P side via VLLM_NIXL_ABORT_REQUEST_TIMEOUT
- https://github.com/vllm-project/vllm/pull/20139 - Remote consumer READ timeout for clearing blocks
- https://github.com/vllm-project/vllm/pull/32340 - Expired requests metric tracking
Lease TTL renewals for improved freeing of kv blocks on P - https://github.com/vllm-project/vllm/pull/41383 (supersedes VLLM_NIXL_ABORT_REQUEST_TIMEOUT)

Deployment Configurations Guides & Docs

Multi-instance deployments - Multiple P and D instances across hosts
- https://github.com/vllm-project/vllm/pull/28782 - Proxy server improvements for high concurrency
- https://github.com/vllm-project/vllm/pull/24249 - Clearer deployment examples
Data Parallel support - DP deployments with per-rank side channel ports
- https://github.com/vllm-project/vllm/pull/28782 - DP deployment documentation and proxy improvements
Bidirectional KV Transfer - https://github.com/vllm-project/vllm/pull/43097
KV Cache block Lease mechanism - https://github.com/vllm-project/vllm/pull/43099

Spec Decoding

Speculative decoding integration - P/D disaggregation with speculative decoding +CI @ZhanqiuHu
- Fix for layer-by-layer transfer (not NIXL) https://github.com/vllm-project/vllm/pull/35158
- https://github.com/vllm-project/vllm/pull/35760
HMA enabled Hybrid (SW) SD integration - example Kimi K2.5 Eagle3 see https://vllm-dev.slack.com/archives/C07RFT2UT16/p1773434620383319
- Follows us from https://github.com/vllm-project/vllm/pull/35758
Hybrid SSM + SD
- Qwen3.5+MTP - https://github.com/vllm-project/vllm/pull/42677
Clarify Spec Decoding compatibility matrix

SSM

NIXL+Hybrid Memory Allocator - https://github.com/vllm-project/vllm/pull/35758
SSM (Mamba) support - https://github.com/vllm-project/vllm/pull/34727
TP>1 async scheduling issue - https://github.com/vllm-project/vllm/issues/37285
Fix Mamba P/D workflow to avoid recomputing last token twice (both on P and D) - https://github.com/vllm-project/vllm/pull/37310
Heterogeneous TP support for hybrid SSM models - https://github.com/vllm-project/vllm/pull/37635/

Work in Progress

Planned

Nixl + HMA support request failure handling
Fix SpecDecoding asymmetric num_speculative_tokens UX - https://github.com/vllm-project/vllm/pull/43733#pullrequestreview-4370641549 (will likely require roles to be defined at config time)
Documentation improvements - Clarify PD feature matrix in docs with examples
Multi-backend model support - Models with multiple attention backends (mostly validation of HMA feature coverage)
Hybrid hardware deployment - Supported in the measure tested by @xuechendi and team. Another AMD-Nvidia use-case reported https://uccl-project.github.io/posts/uccl-ep-full/. This is un-tested in CI and we should clarify capabilities and limitations.
Mamba1 support
FP8 kv cache support (attention-dependent for now, depending on how scales are stored) - Issue requesting support https://github.com/vllm-project/vllm/issues/42179
nvfp4 kv cache support

Backlog

HTTP-based handshake endpoint - Replace ZMQ side channel with HTTP for better observability
Transfer Failure handling for HMA
More efficient h2d copy_blocks operations for HMA groups
Heterogenous block size (blcok_size_ration > 1) HMA support

RFCs

Bi-directional KV transfers with Nixl connector - https://github.com/vllm-project/vllm/issues/32733
Remove Per-Block KV Transfer Error Handling - https://github.com/vllm-project/vllm/issues/35780
kv_both role deprecation - https://github.com/vllm-project/vllm/issues/43807

Known Issues/Bugs:

Gemma3 heterogeneous TP accuracy drop - https://github.com/vllm-project/vllm/issues/37333
TP-dependent block sizes break HMA heterogeneous TP - https://github.com/vllm-project/vllm/issues/41037
Bidirectional KV transfer produces incorrect results when reasoning traces are stripped between turns - https://github.com/vllm-project/vllm/issues/43094
PD+SD Prefix-cache trimming drops wrong block when P has extra lookahead block - https://github.com/vllm-project/vllm/issues/43996

Bug Fixes

Fix NIXL handshake failures not honoring kv_load_failure_policy - https://github.com/vllm-project/vllm/pull/33745
Fix multi-node TP (TP>8) - https://github.com/vllm-project/vllm/pull/39907
Fix CG dispatch for mamba PD to FULL_DECODE - https://github.com/vllm-project/vllm/pull/42430

Related Projects

Encoder-Prefill-Decode Disaggregation: https://github.com/vllm-project/vllm/pull/25233
Mooncake Transfer Engine: https://github.com/vllm-project/vllm/pull/24718, https://github.com/vllm-project/vllm/pull/31573

cc @robertgshaw2-redhat @tlrmchlsmth @markmc @njhill @orozery

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Guia do colaborador

Direção de pesquisa: Revise o roadmap e identifique uma tarefa não atribuída e aberta, como 'FP8 kv cache support' ou 'Documentation improvements'. Verifique as issues vinculadas para subtarefas mais simples.
Pilha de tecnologia: python
Domain: backendinfrastructure
Tipo Issue: Pesquisa
Difficulty: 4
Tempo estimado: Mais de 1 semana
Status da atividade: Ativo
Clarity: Claro
Prerequisites: PythonGit
Simpatia para novatos: 10

Métricas do repositório

Description

🚀 The feature, motivation and pitch

Description

Currently Supported Features

Work in Progress

RFCs

Known Issues/Bugs:

Bug Fixes

Related Projects

Alternatives

Additional context

Before submitting a new issue...

Guia do colaborador

Receba issues Easy novas por email.