vllm-project/vllm

[RFC]: Support ViT Full CUDA Graph (Tracker)

Open

#38 175 ouverte le 26 mars 2026

Voir sur GitHub
 (23 commentaires) (1 réaction) (0 assignés)Python (16 816 forks)batch import
RFChelp wantedmulti-modality

Métriques du dépôt

Stars
 (80 034 stars)
Métriques de merge PR
 (Merge moyen 9j 2h) (921 PRs mergées en 30 j)

Description

Motivation.

Multimodal large language models (e.g., Qwen3-VL, Qwen3.5, GLM-V, Kimi K2.5) rely on a Vision Transformer (ViT) encoder to process visual inputs before feeding them into the language model backbone. In production serving scenarios, the ViT forward pass involves launching a large number of small CUDA kernels — including patch embedding, layer normalization, multi-head self-attention, and MLP projections — each of which incurs non-trivial kernel launch overhead on the host side.

Currently, vLLM supports CUDA graph capture for the decoder (LLM) portion of the model, which has proven effective at reducing kernel launch costs and improving throughput. However, the ViT encoder is still executed eagerly, meaning every forward pass re-launches all kernels from scratch. Extending full CUDA graph support to the ViT encoder would allow the entire encoder forward pass to be captured and replayed as a single graph, eliminating per-kernel launch overhead and enabling more consistent, low-latency inference for multimodal models.

Proposed Change.

Model Integration:

[!NOTE] Integration Workflow:

  1. Implement ViT CUDA graph interface for the model referring to Qwen3-VL.
  2. Do tests: ut/e2e/benchmark/...
  3. Update supported model list in the doc.
  4. Add this model to CI test.

Bugfix / Improvement:

Testing Coverage:

Documentation:

Feedback Period.

No response

CC List.

@ywang96 @Isotr0py @wangshangsam

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Guide contributeur