[RFC]: Support ViT Full CUDA Graph (Tracker) · vllm-project/vllm#38175

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

Motivation.

Multimodal large language models (e.g., Qwen3-VL, Qwen3.5, GLM-V, Kimi K2.5) rely on a Vision Transformer (ViT) encoder to process visual inputs before feeding them into the language model backbone. In production serving scenarios, the ViT forward pass involves launching a large number of small CUDA kernels — including patch embedding, layer normalization, multi-head self-attention, and MLP projections — each of which incurs non-trivial kernel launch overhead on the host side.

Currently, vLLM supports CUDA graph capture for the decoder (LLM) portion of the model, which has proven effective at reducing kernel launch costs and improving throughput. However, the ViT encoder is still executed eagerly, meaning every forward pass re-launches all kernels from scratch. Extending full CUDA graph support to the ViT encoder would allow the entire encoder forward pass to be captured and replayed as a single graph, eliminating per-kernel launch overhead and enabling more consistent, low-latency inference for multimodal models.

Proposed Change.

Model Integration:

[!NOTE] Integration Workflow:

Implement ViT CUDA graph interface for the model referring to Qwen3-VL.

Do tests: ut/e2e/benchmark/...

Update supported model list in the doc.

Add this model to CI test.

Bugfix / Improvement:

Testing Coverage:

https://github.com/vllm-project/vllm/pull/40780

Documentation:

Feedback Period.

No response

CC List.

@ywang96 @Isotr0py @wangshangsam

Any Other Things.