sgl-project/sglang

[Feature] Parallelize module loading to speed up server launch and refit for Diffusion models

Open

Aperta il 20 feb 2026

Vedi su GitHub
 (1 commento) (0 reazioni) (1 assegnatario)Python (28.442 star) (6216 fork)auto 404
good first issue

Descrizione

Checklist

Motivation

Based on the detailed breakdown in issue #19087, the current launch process for SGLang Diffusion (using Qwen-Image as an example) is heavily bottlenecked by module loading, which currently happens sequentially.

As shown in the profiling results, load_modules takes approximately 63.20s out of the total 77.09s launch time (~82%).

  • text_encoder: 32.96s (52% of loading time)
  • transformer: 29.76s (47% of loading time)
  • vae/tokenizer/scheduler: < 1s combined.

Since these modules (Text Encoder, Transformer, VAE) are independent during the instantiation and weight-loading phase, the current sequential execution is suboptimal. We should implement a parallel loading strategy for load_modules. By using multi-threading (e.g., concurrent.futures.ThreadPoolExecutor or other parallel weights load strategies in SGLang LLM already) to trigger the initialization and weight transfer of the text encoder and transformer simultaneously, we may theoretically reduce the module loading time from ~63s to ~33s (the max of the two major components).

Benefits:

  1. Faster Cold Start: Significantly reduces the time for launch_server.
  2. Optimized Refit/Wake-up: As discussed in #19090, if we support offloading and waking up, parallelizing the weight reload from CPU/Disk to GPU will make the "wake-up" latency much lower.

Implementation Consideration:

  1. Ensure thread safety for CUDA context operations during model.to(device).
  2. Monitor Peak CPU memory usage when loading multiple state dicts into RAM simultaneously before moving them to GPU.

Apart from this, it should also be helpful to validate whether we are loading the weights with Pageable memory.

https://github.com/sgl-project/sglang/issues/19087#issuecomment-3937649804

Spend 32s in loading text_encoder looks strange, since if we are using Pinned memory, the load speed could be much higher.

Related resources

No response

Guida contributor