[Feature]: Enable LoRA support for tower and connector in more MM models · vllm-project/vllm#31479

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

🚀 The feature, motivation and pitch

Regarding multi-modal models, we have supported adding LoRA to the tower encoder and connector,see: #26674, but have only implemented it for a few models (Qwen VL series and idefics3). There is no reason not to support other multi-modal models.

Solution

For the remaining models we want to support adding LoRA to the tower encoder and connector, we need to implement the following 2 functions:

get_num_mm_encoder_tokens get_num_mm_connector_tokens

The root cause we need to implement these two functions is: the number of multi-modal tokens represented in the language model does not necessarily match the input length required by the linear layers in the vision tower or connector. Since the lora_mapping requires the precise input token length prior to activation, these helper functions are necessary to bridge the discrepancy and calculate the correct lengths.

List of models that are completed or WIP

Qwen VL series: #26674
idefics3: #26674
LLaVA: https://github.com/vllm-project/vllm/pull/31513
BLIP2: https://github.com/vllm-project/vllm/pull/31620
GLM4 : https://github.com/vllm-project/vllm/pull/31652
PaliGemma https://github.com/vllm-project/vllm/pull/31656
H2OVL https://github.com/vllm-project/vllm/pull/31696
Pixtral https://github.com/vllm-project/vllm/pull/31724
DotsOCR https://github.com/vllm-project/vllm/pull/31825
InternVL2 https://github.com/vllm-project/vllm/pull/32397
Gemma3 https://github.com/vllm-project/vllm/pull/32764
Llama 4 Vision https://github.com/vllm-project/vllm/pull/35147
Gemma4 https://github.com/vllm-project/vllm/pull/39291

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction: Pick a multi modal model not yet supported (e.g., LLaVA, BLIP2). Study the existing implementations for Qwen VL or idefics3 to understand the required functions: get num mm encoder tokens and get num mm connector tokens. Examine the model's vision tower and connector structure to implement these functions correctly. Validate by running tests.
Tech stack: pythonpytorch
Domain: machine learningaibackend
Issue type: Feature
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: PythonLoRAMulti modal models
Newbie friendliness: 60