vllm-project/vllm
View on GitHub[Roadmap]: PD Disaggregation with `NixlConnector` Roadmap
Open
#33702 opened on Feb 3, 2026
feature requesthelp wanted
Description
🚀 The feature, motivation and pitch
Description
This RFC tracks the current state and planned improvements for Prefill-Decode (P/D) Disaggregation using the NixlConnector, which enables high-performance KV cache transfer between prefill and decode instances using the NIXL library.
Currently Supported Features
Core Infrastructure
- NIXL Integration - Core P/D disaggregation framework (https://github.com/vllm-project/vllm/pull/17751)
Async KV Cache Transfers
- Fully asynchronous KV cache transfers
- https://github.com/vllm-project/vllm/pull/33377 - Bugfix for async scheduling + request abort + async KV transfer
- https://github.com/vllm-project/vllm/pull/28327 - Simplify async KV output aggregation
- https://github.com/vllm-project/vllm/pull/27648 - Async scheduling support
- https://github.com/vllm-project/vllm/pull/31583 - Fix resuming preempted requests after async load
Multi-Transport Backend Support
- Multi-transport backend support - UCX (default), LIBFABRIC, and other NIXL plugins
- ROCm support through RIXL library
- Support for OOT NIXL backends via kv_connector_extra_config (https://github.com/vllm-project/vllm/pull/33552)
Tensor Parallelism
- Homogeneous Tensor Parallelism - P and D instances with matching TP sizes (https://github.com/vllm-project/vllm/pull/17751)
- Heterogeneous Tensor Parallelism - Support for different TP sizes between P and D
- https://github.com/vllm-project/vllm/pull/18833 - Base heterogeneous D TP > P TP support
- https://github.com/vllm-project/vllm/pull/20189 - Heterogeneous TP for FlashInfer
- https://github.com/vllm-project/vllm/pull/27274 - P TP > D TP support (including MLA use-case)
MLA
- Add support for MLA caches with different latent dim (Deepseek v3.2 Indexer)
CPU Host Buffer Transfers
- CPU host buffer transfers - Support for platforms without direct NIXL GPU-GPU transfer (D2H->H2D), for TPU, XPU and more.
- https://github.com/vllm-project/vllm/pull/18293 - Base CPU transfer support
- https://github.com/vllm-project/vllm/pull/24690 - CUDA to CPU memory transfers
- https://github.com/vllm-project/vllm/pull/28356 - Pure CPU environment support
Heterogeneous Configurations The following also partially enable Hybrid hardware deployment among other use-cases.
- Support kernel_block_size != block_size (logical <> physical block_size mismatch)
- Heterogeneous block sizes - Different block sizes between P and D instances (cc @xuechendi )
- https://github.com/vllm-project/vllm/pull/26759 - Heterogeneous block_size support
- https://github.com/vllm-project/vllm/pull/30275 - Decoder-side post-processing for heterogeneous BlockSize
- Heterogeneous KV layout (experimental) - HND to NHD permutation via enable_permute_local_kv
- https://github.com/vllm-project/vllm/pull/27743 - Cross-layer KV blocks support
- https://github.com/vllm-project/vllm/pull/30275 - Heterogeneous layout handling
Reliability & Observability
- Compatibility hash validation - Automatic P/D configuration compatibility checking
- https://github.com/vllm-project/vllm/pull/29503 - Compatibility checking in NIXL handshake
- https://github.com/vllm-project/vllm/pull/30182 - Debug logging level for compatibility hash
- Transfer failure handling - Block invalidation and kv_load_failure_policy (fail/recompute)
- https://github.com/vllm-project/vllm/pull/32031 - Failure logging overhaul + early metadata free
- https://github.com/vllm-project/vllm/pull/28120 - Avoid NIXL_ERR_REMOTE_DISCONNECT on prefill failure
- https://github.com/vllm-project/vllm/pull/32198 - Document fail kv_load_failure_policy
- https://github.com/vllm-project/vllm/pull/29665 - Add remote_request_id for better tracking
- NIXL telemetry and metrics - Transfer duration, throughput, failure counters (Prometheus)
- https://github.com/vllm-project/vllm/pull/25388 - Expose NIXL metrics for CLI logging
- https://github.com/vllm-project/vllm/pull/22188 - KVTransferMetrics aggregation strategy
- https://github.com/vllm-project/vllm/pull/32340 - Track nixl_num_kv_expired_reqs in Prometheus
- https://github.com/vllm-project/vllm/pull/28309 - KV events infrastructure
- Request timeout/expiration - Automatic KV block release on P side via VLLM_NIXL_ABORT_REQUEST_TIMEOUT
- https://github.com/vllm-project/vllm/pull/20139 - Remote consumer READ timeout for clearing blocks
- https://github.com/vllm-project/vllm/pull/32340 - Expired requests metric tracking
- Lease TTL renewals for improved freeing of kv blocks on P - https://github.com/vllm-project/vllm/pull/41383 (supersedes
VLLM_NIXL_ABORT_REQUEST_TIMEOUT)
Deployment Configurations Guides & Docs
- Multi-instance deployments - Multiple P and D instances across hosts
- https://github.com/vllm-project/vllm/pull/28782 - Proxy server improvements for high concurrency
- https://github.com/vllm-project/vllm/pull/24249 - Clearer deployment examples
- Data Parallel support - DP deployments with per-rank side channel ports
- https://github.com/vllm-project/vllm/pull/28782 - DP deployment documentation and proxy improvements
- Bidirectional KV Transfer - https://github.com/vllm-project/vllm/pull/43097
- KV Cache block Lease mechanism - https://github.com/vllm-project/vllm/pull/43099
Spec Decoding
- Speculative decoding integration - P/D disaggregation with speculative decoding +CI @ZhanqiuHu
- Fix for layer-by-layer transfer (not NIXL) https://github.com/vllm-project/vllm/pull/35158
- https://github.com/vllm-project/vllm/pull/35760
- HMA enabled Hybrid (SW) SD integration - example Kimi K2.5 Eagle3 see https://vllm-dev.slack.com/archives/C07RFT2UT16/p1773434620383319
- Follows us from https://github.com/vllm-project/vllm/pull/35758
- Hybrid SSM + SD
- Qwen3.5+MTP - https://github.com/vllm-project/vllm/pull/42677
- Clarify Spec Decoding compatibility matrix
SSM
- NIXL+Hybrid Memory Allocator - https://github.com/vllm-project/vllm/pull/35758
- SSM (Mamba) support - https://github.com/vllm-project/vllm/pull/34727
- TP>1 async scheduling issue - https://github.com/vllm-project/vllm/issues/37285
- Fix Mamba P/D workflow to avoid recomputing last token twice (both on P and D) - https://github.com/vllm-project/vllm/pull/37310
- Heterogeneous TP support for hybrid SSM models - https://github.com/vllm-project/vllm/pull/37635/
Work in Progress
- Enhanced error diagnostics - Structured logging with failure context for easier debugging
- Enable drain scaledown mode for single process deployments - https://github.com/vllm-project/vllm/pull/32420
- Speculative decoding integration - P/D disaggregation with speculative decoding +CI @ZhanqiuHu
- CPU Offloading verification and CI tests - https://github.com/vllm-project/vllm/pull/39200
- PD+CPU Offloading+Spec Decoding verification and CI tests
- DecodeContextParallel support - https://github.com/vllm-project/vllm/pull/38433
- Bi-directional KV transfers with Nixl connector - https://github.com/vllm-project/vllm/pull/32553
- GDN model support (Qwen3.5+) - https://github.com/vllm-project/vllm/pull/41869
- Mamba prefix caching mode
align/all- https://github.com/vllm-project/vllm/pull/42554 - NixlConnector Refactor:
Planned
- Nixl + HMA support request failure handling
- Documentation improvements - Clarify PD feature matrix in docs with examples
- Pipeline parallelism support - P/D disaggregation with pipeline parallelism
- Multi-backend model support - Models with multiple attention backends (mostly validation of HMA feature coverage)
- Hybrid hardware deployment - Supported in the measure tested by @xuechendi and team. Another AMD-Nvidia use-case reported https://uccl-project.github.io/posts/uccl-ep-full/. This is un-tested in CI and we should clarify capabilities and limitations.
- Mamba1 support
- FP8 support (attention-dependent for now, depending on how scales are stored) - Issue requesting support https://github.com/vllm-project/vllm/issues/42179
Backlog
- HTTP-based handshake endpoint - Replace ZMQ side channel with HTTP for better observability
- Transfer Failure handling for HMA
- More efficient h2d
copy_blocksoperations for HMA groups - Heterogenous block size (blcok_size_ration > 1) HMA support
RFCs
- Bi-directional KV transfers with Nixl connector - https://github.com/vllm-project/vllm/issues/32733
- Remove Per-Block KV Transfer Error Handling - https://github.com/vllm-project/vllm/issues/35780
Known Issues/Bugs:
- Gemma3 heterogeneous TP accuracy drop - https://github.com/vllm-project/vllm/issues/37333
- TP-dependent block sizes break HMA heterogeneous TP - https://github.com/vllm-project/vllm/issues/41037
- Bidirectional KV transfer produces incorrect results when reasoning traces are stripped between turns - https://github.com/vllm-project/vllm/issues/43094
Bug Fixes
- Fix NIXL handshake failures not honoring
kv_load_failure_policy- https://github.com/vllm-project/vllm/pull/33745 - Fix multi-node TP (TP>8) - https://github.com/vllm-project/vllm/pull/39907
- Fix CG dispatch for mamba PD to FULL_DECODE - https://github.com/vllm-project/vllm/pull/42430
Related Projects
- Encoder-Prefill-Decode Disaggregation: https://github.com/vllm-project/vllm/pull/25233
- Mooncake Transfer Engine: https://github.com/vllm-project/vllm/pull/24718, https://github.com/vllm-project/vllm/pull/31573
cc @robertgshaw2-redhat @tlrmchlsmth @markmc @njhill @orozery
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.