[Bug]: ModelOpt Llama-4 Checkpoints Take 5+ minutes to load
#31624 opened on Jan 2, 2026
Description
🚀 The feature, motivation and pitch
In working on some MoE refactors, I discovered that L4 for ModelOpt takes 5+minutes to load weights even from CPU page cache.
The root cause is basically this hack logic to load the state dict that ModelOpt uses
- https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L439-L523 [modelopt is the fused case]
What happens is that the CPU tensor (loaded weight) that we are going to load into the GPU tensor (param) becomes non-contiguous due to this logic. As a result, when we eventually call _copy() from CPU->GPU we are calling this on a non-contiguous cpu tensor which takes 3-4s per weight.
To hack around this for local R&D, I simply immediately move the loaded_weight to the GPU. This makes the gather happen on the GPU which accelerates things a lot. This isn't reasonable as an actual solution though
We should investigate where the logic in the weight loader can avoid creating non-contiguous CPU tensors
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.