vllm-project/vllm

[Bug]: ModelOpt Llama-4 Checkpoints Take 5+ minutes to load

Open

#31624 opened on Jan 2, 2026

View on GitHub
 (12 comments) (0 reactions) (0 assignees)Python (80,034 stars) (16,816 forks)batch import
bugfeature requestgood first issuehelp wanted

Description

🚀 The feature, motivation and pitch

In working on some MoE refactors, I discovered that L4 for ModelOpt takes 5+minutes to load weights even from CPU page cache.

The root cause is basically this hack logic to load the state dict that ModelOpt uses

What happens is that the CPU tensor (loaded weight) that we are going to load into the GPU tensor (param) becomes non-contiguous due to this logic. As a result, when we eventually call _copy() from CPU->GPU we are calling this on a non-contiguous cpu tensor which takes 3-4s per weight.

To hack around this for local R&D, I simply immediately move the loaded_weight to the GPU. This makes the gather happen on the GPU which accelerates things a lot. This isn't reasonable as an actual solution though

We should investigate where the logic in the weight loader can avoid creating non-contiguous CPU tensors

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide