vllm-project/vllm

[Roadmap]: PD Disaggregation with `NixlConnector` Roadmap

Open

#33.702 aberto em 3 de fev. de 2026

Ver no GitHub
 (5 comments) (15 reactions) (0 assignees)Python (16.816 forks)batch import
feature requesthelp wanted

Métricas do repositório

Stars
 (80.034 stars)
Métricas de merge de PR
 (Mesclagem média 9d 2h) (921 fundiu PRs em 30d)

Description

🚀 The feature, motivation and pitch

Description

This RFC tracks the current state and planned improvements for Prefill-Decode (P/D) Disaggregation using the NixlConnector, which enables high-performance KV cache transfer between prefill and decode instances using the NIXL library.

Currently Supported Features

Core Infrastructure

Async KV Cache Transfers

Multi-Transport Backend Support

Tensor Parallelism

MLA

CPU Host Buffer Transfers

Heterogeneous Configurations The following also partially enable Hybrid hardware deployment among other use-cases.

Reliability & Observability

Deployment Configurations Guides & Docs

Spec Decoding

SSM

Work in Progress

Planned

  • Nixl + HMA support request failure handling
  • Fix SpecDecoding asymmetric num_speculative_tokens UX - https://github.com/vllm-project/vllm/pull/43733#pullrequestreview-4370641549 (will likely require roles to be defined at config time)
  • Documentation improvements - Clarify PD feature matrix in docs with examples
  • Multi-backend model support - Models with multiple attention backends (mostly validation of HMA feature coverage)
  • Hybrid hardware deployment - Supported in the measure tested by @xuechendi and team. Another AMD-Nvidia use-case reported https://uccl-project.github.io/posts/uccl-ep-full/. This is un-tested in CI and we should clarify capabilities and limitations.
  • Mamba1 support
  • FP8 kv cache support (attention-dependent for now, depending on how scales are stored) - Issue requesting support https://github.com/vllm-project/vllm/issues/42179
  • nvfp4 kv cache support

Backlog

  • HTTP-based handshake endpoint - Replace ZMQ side channel with HTTP for better observability
  • Transfer Failure handling for HMA
  • More efficient h2d copy_blocks operations for HMA groups
  • Heterogenous block size (blcok_size_ration > 1) HMA support

RFCs

Known Issues/Bugs:

Bug Fixes

Related Projects

cc @robertgshaw2-redhat @tlrmchlsmth @markmc @njhill @orozery

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Guia do colaborador