vllm-project/vllm

[Bug]: ModelOpt Llama-4 Checkpoints Take 5+ minutes to load

Open

#31,624 opened on 2026年1月2日

GitHub で見る
 (12 comments) (0 reactions) (0 assignees)Python (80,034 stars) (16,816 forks)batch import
bugfeature requestgood first issuehelp wanted

説明

🚀 The feature, motivation and pitch

In working on some MoE refactors, I discovered that L4 for ModelOpt takes 5+minutes to load weights even from CPU page cache.

The root cause is basically this hack logic to load the state dict that ModelOpt uses

What happens is that the CPU tensor (loaded weight) that we are going to load into the GPU tensor (param) becomes non-contiguous due to this logic. As a result, when we eventually call _copy() from CPU->GPU we are calling this on a non-contiguous cpu tensor which takes 3-4s per weight.

To hack around this for local R&D, I simply immediately move the loaded_weight to the GPU. This makes the gather happen on the GPU which accelerates things a lot. This isn't reasonable as an actual solution though

We should investigate where the logic in the weight loader can avoid creating non-contiguous CPU tensors

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

コントリビューターガイド