Description
Motivation.
Multimodal large language models (e.g., Qwen3-VL, Qwen3.5, GLM-V, Kimi K2.5) rely on a Vision Transformer (ViT) encoder to process visual inputs before feeding them into the language model backbone. In production serving scenarios, the ViT forward pass involves launching a large number of small CUDA kernels — including patch embedding, layer normalization, multi-head self-attention, and MLP projections — each of which incurs non-trivial kernel launch overhead on the host side.
Currently, vLLM supports CUDA graph capture for the decoder (LLM) portion of the model, which has proven effective at reducing kernel launch costs and improving throughput. However, the ViT encoder is still executed eagerly, meaning every forward pass re-launches all kernels from scratch. Extending full CUDA graph support to the ViT encoder would allow the entire encoder forward pass to be captured and replayed as a single graph, eliminating per-kernel launch overhead and enabling more consistent, low-latency inference for multimodal models.
Proposed Change.
Model Integration:
- https://github.com/vllm-project/vllm/pull/35963 @b-mu
- https://github.com/vllm-project/vllm/pull/38061 @shen-shanshan
- https://github.com/vllm-project/vllm/pull/42151 @shen-shanshan
- https://github.com/vllm-project/vllm/pull/41736 @johncalesp
- https://github.com/vllm-project/vllm/pull/40830 @johncalesp
- https://github.com/vllm-project/vllm/pull/42224 @JisoLya
- https://github.com/vllm-project/vllm/pull/40576 @grYe99
- https://github.com/vllm-project/vllm/pull/40660 @allgather
- https://github.com/vllm-project/vllm/pull/41759 @oguzhankir
- https://github.com/vllm-project/vllm/pull/41992 @oguzhankir
- https://github.com/vllm-project/vllm/pull/42785 @YunzhuLu
- Support InternVL @evezhier
- Support DeepSeek-OCR @shen-shanshan
- Support DeepSeek-OCR-2 @shen-shanshan
- Support PaddleOCR-VL @harsha20032020
[!NOTE] Integration Workflow:
- Implement ViT CUDA graph interface for the model referring to Qwen3-VL.
- Do tests: ut/e2e/benchmark/...
- Update supported model list in the doc.
- Add this model to CI test.
Bugfix / Improvement:
- https://github.com/vllm-project/vllm/pull/38040 @b-mu
- https://github.com/vllm-project/vllm/pull/38116
- https://github.com/vllm-project/vllm/pull/40445 @shen-shanshan
- https://github.com/vllm-project/vllm/pull/40580 @shen-shanshan
- https://github.com/vllm-project/vllm/pull/41234
- https://github.com/vllm-project/vllm/pull/41714
- https://github.com/vllm-project/vllm/pull/42288
- https://github.com/vllm-project/vllm/pull/43082
- https://github.com/vllm-project/vllm/pull/43321
- https://github.com/vllm-project/vllm/pull/43403
Testing Coverage:
- https://github.com/vllm-project/vllm/pull/40780 @shen-shanshan
Documentation:
- https://github.com/vllm-project/vllm/pull/37914 @b-mu
- https://github.com/vllm-project/vllm/pull/40355 @shen-shanshan
Feedback Period.
No response
CC List.
@ywang96 @Isotr0py @wangshangsam
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.