[RFC]: Support ViT Full CUDA Graph (Tracker) · vllm-project/vllm#38175

Métriques du dépôt

Stars: (80 034 stars)
Métriques de merge PR: (Merge moyen 9j 2h) (921 PRs mergées en 30 j)

Description

Motivation.

Multimodal large language models (e.g., Qwen3-VL, Qwen3.5, GLM-V, Kimi K2.5) rely on a Vision Transformer (ViT) encoder to process visual inputs before feeding them into the language model backbone. In production serving scenarios, the ViT forward pass involves launching a large number of small CUDA kernels — including patch embedding, layer normalization, multi-head self-attention, and MLP projections — each of which incurs non-trivial kernel launch overhead on the host side.

Currently, vLLM supports CUDA graph capture for the decoder (LLM) portion of the model, which has proven effective at reducing kernel launch costs and improving throughput. However, the ViT encoder is still executed eagerly, meaning every forward pass re-launches all kernels from scratch. Extending full CUDA graph support to the ViT encoder would allow the entire encoder forward pass to be captured and replayed as a single graph, eliminating per-kernel launch overhead and enabling more consistent, low-latency inference for multimodal models.

Proposed Change.

Model Integration:

[!NOTE] Integration Workflow:

Implement ViT CUDA graph interface for the model referring to Qwen3-VL.

Do tests: ut/e2e/benchmark/...

Update supported model list in the doc.

Add this model to CI test.

Bugfix / Improvement:

Testing Coverage:

https://github.com/vllm-project/vllm/pull/40780

Documentation:

Feedback Period.

No response

CC List.

@ywang96 @Isotr0py @wangshangsam

Any Other Things.