[Bug]: ModelOpt Llama-4 Checkpoints Take 5+ minutes to load · vllm-project/vllm#31624

(12 comments) (0 reactions) (0 assignees)Python (16,816 forks)batch import

bugfeature requestgood first issuehelp wanted

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

🚀 The feature, motivation and pitch

In working on some MoE refactors, I discovered that L4 for ModelOpt takes 5+minutes to load weights even from CPU page cache.

https://huggingface.co/nvidia/Llama-4-Scout-17B-16E-Instruct-FP8

The root cause is basically this hack logic to load the state dict that ModelOpt uses

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py#L439-L523 [modelopt is the fused case]

What happens is that the CPU tensor (loaded weight) that we are going to load into the GPU tensor (param) becomes non-contiguous due to this logic. As a result, when we eventually call _copy() from CPU->GPU we are calling this on a non-contiguous cpu tensor which takes 3-4s per weight.

To hack around this for local R&D, I simply immediately move the loaded_weight to the GPU. This makes the gather happen on the GPU which accelerates things a lot. This isn't reasonable as an actual solution though

We should investigate where the logic in the weight loader can avoid creating non-contiguous CPU tensors

Alternatives

No response

Additional context