[Feature] Parallelize module loading to speed up server launch and refit for Diffusion models
#19,092 opened on Feb 20, 2026
Description
Checklist
- If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Motivation
Based on the detailed breakdown in issue #19087, the current launch process for SGLang Diffusion (using Qwen-Image as an example) is heavily bottlenecked by module loading, which currently happens sequentially.
As shown in the profiling results, load_modules takes approximately 63.20s out of the total 77.09s launch time (~82%).
text_encoder: 32.96s (52% of loading time)transformer: 29.76s (47% of loading time)vae/tokenizer/scheduler: < 1s combined.
Since these modules (Text Encoder, Transformer, VAE) are independent during the instantiation and weight-loading phase, the current sequential execution is suboptimal. We should implement a parallel loading strategy for load_modules. By using multi-threading (e.g., concurrent.futures.ThreadPoolExecutor or other parallel weights load strategies in SGLang LLM already) to trigger the initialization and weight transfer of the text encoder and transformer simultaneously, we may theoretically reduce the module loading time from ~63s to ~33s (the max of the two major components).
Benefits:
- Faster Cold Start: Significantly reduces the time for
launch_server. - Optimized Refit/Wake-up: As discussed in #19090, if we support offloading and waking up, parallelizing the weight reload from CPU/Disk to GPU will make the "wake-up" latency much lower.
Implementation Consideration:
- Ensure thread safety for CUDA context operations during
model.to(device). - Monitor Peak CPU memory usage when loading multiple state dicts into RAM simultaneously before moving them to GPU.
Apart from this, it should also be helpful to validate whether we are loading the weights with Pageable memory.
https://github.com/sgl-project/sglang/issues/19087#issuecomment-3937649804
Spend 32s in loading text_encoder looks strange, since if we are using Pinned memory, the load speed could be much higher.
Related resources
No response