[Feature] Parallelize module loading to speed up server launch and refit for Diffusion models · sgl-project/sglang#19092

Repository metrics

Stars: (28,442 stars)
PR merge metrics: (Avg merge 2d 1h) (1,000 merged PRs in 30d)

Description

Checklist

If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
Please use English. Otherwise, it will be closed.

Motivation

Based on the detailed breakdown in issue #19087, the current launch process for SGLang Diffusion (using Qwen-Image as an example) is heavily bottlenecked by module loading, which currently happens sequentially.

As shown in the profiling results, load_modules takes approximately 63.20s out of the total 77.09s launch time (~82%).

text_encoder: 32.96s (52% of loading time)
transformer: 29.76s (47% of loading time)
vae/tokenizer/scheduler: < 1s combined.

Since these modules (Text Encoder, Transformer, VAE) are independent during the instantiation and weight-loading phase, the current sequential execution is suboptimal. We should implement a parallel loading strategy for load_modules. By using multi-threading (e.g., concurrent.futures.ThreadPoolExecutor or other parallel weights load strategies in SGLang LLM already) to trigger the initialization and weight transfer of the text encoder and transformer simultaneously, we may theoretically reduce the module loading time from ~63s to ~33s (the max of the two major components).

Benefits:

Faster Cold Start: Significantly reduces the time for launch_server.
Optimized Refit/Wake-up: As discussed in #19090, if we support offloading and waking up, parallelizing the weight reload from CPU/Disk to GPU will make the "wake-up" latency much lower.

Implementation Consideration:

Ensure thread safety for CUDA context operations during model.to(device).
Monitor Peak CPU memory usage when loading multiple state dicts into RAM simultaneously before moving them to GPU.

Apart from this, it should also be helpful to validate whether we are loading the weights with Pageable memory.

https://github.com/sgl-project/sglang/issues/19087#issuecomment-3937649804

Spend 32s in loading text_encoder looks strange, since if we are using Pinned memory, the load speed could be much higher.

Related resources

No response

Contributor guide

Research direction: Investigate parallelizing the load modules function using concurrent.futures.ThreadPoolExecutor for independent modules (text encoder, transformer, vae) to reduce launch time. Ensure CUDA context thread safety and monitor peak CPU memory usage.
Tech stack: pythonpytorch
Domain: backend
Issue type: Feature
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: PythonPyTorchCUDA
Newbie friendliness: 40