[Feature]: Composite model loading using `AutoWeightsLoader` for all models · vllm-project/vllm#15697

(39 comments) (0 reactions) (0 assignees)Python (16,816 forks)batch import

feature requestgood first issuekeep-open

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

🚀 The feature, motivation and pitch

#9160 first introduced AutoWeightsLoader to recursively call load_weights on sub-modules. This lets composite models (most notably multi-modal models) use language backbones (*Model classes such as LlamaModel) without having to repeat their weight loading logic.

Currently, load_weights is only implemented in a few language backbones. It would be great to standardize this approach and apply it to all language backbones in vLLM. The steps to do this are pretty straightforward:

Move the existing load_weights function from *ForCausalLM to *Model.
Create a new load_weights function in *ForCausalLM that loads the weights using AutoWeightsLoader.
Move any logic in *Model.load_weights that only applies to *ForCausalLM back to *ForCausalLM.load_weights. Usually, this involves lm_head.

For reference, you can look at the implementation for models such as Llama, Gemma2/3, Qwen2 and ChatGLM.

To avoid scope creep, I suggest opening a PR for updating only a few models at a time

Alternatives

No response

Additional context