[Feature] Parallelize module loading to speed up server launch and refit for Diffusion models · sgl-project/sglang#19092

(1 commento) (0 reazioni) (1 assegnatario)Python (28.442 star) (6216 fork)auto 404

good first issue

Descrizione

Checklist

If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
Please use English. Otherwise, it will be closed.

Motivation

Based on the detailed breakdown in issue #19087, the current launch process for SGLang Diffusion (using Qwen-Image as an example) is heavily bottlenecked by module loading, which currently happens sequentially.

As shown in the profiling results, load_modules takes approximately 63.20s out of the total 77.09s launch time (~82%).

text_encoder: 32.96s (52% of loading time)
transformer: 29.76s (47% of loading time)
vae/tokenizer/scheduler: < 1s combined.

Since these modules (Text Encoder, Transformer, VAE) are independent during the instantiation and weight-loading phase, the current sequential execution is suboptimal. We should implement a parallel loading strategy for load_modules. By using multi-threading (e.g., concurrent.futures.ThreadPoolExecutor or other parallel weights load strategies in SGLang LLM already) to trigger the initialization and weight transfer of the text encoder and transformer simultaneously, we may theoretically reduce the module loading time from ~63s to ~33s (the max of the two major components).

Benefits:

Faster Cold Start: Significantly reduces the time for launch_server.
Optimized Refit/Wake-up: As discussed in #19090, if we support offloading and waking up, parallelizing the weight reload from CPU/Disk to GPU will make the "wake-up" latency much lower.

Implementation Consideration:

Ensure thread safety for CUDA context operations during model.to(device).
Monitor Peak CPU memory usage when loading multiple state dicts into RAM simultaneously before moving them to GPU.

Apart from this, it should also be helpful to validate whether we are loading the weights with Pageable memory.

https://github.com/sgl-project/sglang/issues/19087#issuecomment-3937649804

Spend 32s in loading text_encoder looks strange, since if we are using Pinned memory, the load speed could be much higher.

Related resources

No response

Guida contributor

Tech stack: pythonpytorch
Dominio: backend
Tipo issue: feature
Difficoltà: 3
Tempo stimato: half day
Stato attività: active
Chiarezza: clear
Prerequisiti: PythonPyTorchCUDA
Adatta ai principianti: 40
Direzione di ricerca: Studia la parallelizzazione della funzione load modules utilizzando concurrent.futures.ThreadPoolExecutor per i moduli indipendenti (text encoder, transformer, vae) per ridurre il tempo di avvio. Garantisci la sicurezza del contesto CUDA e monitora l'utilizzo massimo della memoria CPU.