vllm-project/vllm

[Feature]: Enable LoRA support for tower and connector in more MM models

Open

#31479 opened on Dec 29, 2025

View on GitHub
 (10 comments) (2 reactions) (0 assignees)Python (80,034 stars) (16,816 forks)batch import
feature requesthelp wantedmulti-modality

Description

🚀 The feature, motivation and pitch

Regarding multi-modal models, we have supported adding LoRA to the tower encoder and connector,see: #26674, but have only implemented it for a few models (Qwen VL series and idefics3). There is no reason not to support other multi-modal models.

Solution

For the remaining models we want to support adding LoRA to the tower encoder and connector, we need to implement the following 2 functions:

get_num_mm_encoder_tokens get_num_mm_connector_tokens

The root cause we need to implement these two functions is: the number of multi-modal tokens represented in the language model does not necessarily match the input length required by the linear layers in the vision tower or connector. Since the lora_mapping requires the precise input token length prior to activation, these helper functions are necessary to bridge the discrepancy and calculate the correct lengths.

List of models that are completed or WIP

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide