[Feature]: Extract KV-Cache update from all attention backends · vllm-project/vllm#32335

2026-01-14T14:24:56.000Z

### 🚀 The feature, motivation and pitch Similar to how https://github.com/vllm-project/vllm/pull/25954 extracts it from FlashAttn. Ideally, we want to cover all backends with kv-cache update from `v1/attention/backends`. Backends: - [x] FlashAttention - [x] FlashInfer - [ ] AiterFlashAttention (in progress) - [x] RocmAiterUnifiedAttention - [x] RocmAttention - [x] TritonAttention - [x] FlashAttentionDiffKV - [x] FlexAttention - [x] TreeAttention MLA Backends: - [x] FlashAttnMLA - [x] FlashInferMLA - [x] FlashMLASparse - [x] FlashMLA - [x] AiterMLA - [x] ROCMAiterMLASparse - [x] CutlassMLA - [x] TritonMLA After all backends are supported, we can remove `slot_mapping` from attention metadata. ### Alternatives _No response_ ### Additional context _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

貢獻者指南

技術棧: pythonpytorchcpp
領域: backendperformanceai
議題類型: feature
難度: 5
預計時間: over 1 week
活動狀態: active
清晰度: clear
前置要求: vllm attention backendsKV cache optimizationCUDA programmingv1/attention module
新手友善度: 10
研究方向: Review PR #25954 which demonstrates the extraction pattern for FlashAttn. Examine the existing completed backends (e.g., FlashInfer, TritonAttention) in `v1/attention/backends` to understand the structure. The AiterFlashAttention backend is in progress; coordinate with the current assignees. The goal is to implement the KV cache update extraction for the remaining backends and eventually remove `slot mapping` from attention metadata.

描述

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

貢獻者指南