[Feature]: Enable LoRA support for tower and connector in more MM models
#31479 opened on Dec 29, 2025
Description
🚀 The feature, motivation and pitch
Regarding multi-modal models, we have supported adding LoRA to the tower encoder and connector,see: #26674, but have only implemented it for a few models (Qwen VL series and idefics3). There is no reason not to support other multi-modal models.
Solution
For the remaining models we want to support adding LoRA to the tower encoder and connector, we need to implement the following 2 functions:
get_num_mm_encoder_tokens
get_num_mm_connector_tokens
The root cause we need to implement these two functions is: the number of multi-modal tokens represented in the language model does not necessarily match the input length required by the linear layers in the vision tower or connector. Since the lora_mapping requires the precise input token length prior to activation, these helper functions are necessary to bridge the discrepancy and calculate the correct lengths.
List of models that are completed or WIP
- Qwen VL series: #26674
- idefics3: #26674
- LLaVA: https://github.com/vllm-project/vllm/pull/31513
- BLIP2: https://github.com/vllm-project/vllm/pull/31620
- GLM4 : https://github.com/vllm-project/vllm/pull/31652
- PaliGemma https://github.com/vllm-project/vllm/pull/31656
- H2OVL https://github.com/vllm-project/vllm/pull/31696
- Pixtral https://github.com/vllm-project/vllm/pull/31724
- DotsOCR https://github.com/vllm-project/vllm/pull/31825
- InternVL2 https://github.com/vllm-project/vllm/pull/32397
- Gemma3 https://github.com/vllm-project/vllm/pull/32764
- Llama 4 Vision https://github.com/vllm-project/vllm/pull/35147
- Gemma4 https://github.com/vllm-project/vllm/pull/39291
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.