[Feature] Support offload and wake up of SGLang Diffusion · sgl-project/sglang#19090

(5 commenti) (0 reazioni) (0 assegnatari)Python (6216 fork)auto 404

good first issue

Metriche repository

Star: (28.442 star)
Metriche merge PR: (Merge medio 2g 1h) (1000 PR mergiate in 30 g)

Descrizione

Checklist

If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
Please use English. Otherwise, it will be closed.

Motivation

In the LLM RL scenario, sleeping and waking up an SGLang server is widely used and optimized in co-located placement. As detailed in Biao @hebiao064 blog: https://hebiao064.github.io/rl-memory-management

In LLM RL, we use torch_memory_savor to protect the virtual address of the SGLang LLM server in order to keep CUDA Graph alive. Right now in SGLang Diffusion, CUDA Graph is not supported (working on it by @zyksir ), in this sense. We may have more brute fore method to sleep and wake up. In extreme situations, we can even kill and relaunch the SGLang Diffusion server, and the relaunch time is profiled in https://github.com/sgl-project/sglang/issues/19087

In this sense, we may need a way to sleep and wake up SGLang Diffusion. The optimal API should be similar to https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up , but the start point can be more brute force.

If I let myself handle this issue myself, I will break this down into the following steps:

Try out the brute force way to sleep and wake up the SGLang Diffusion Server (like offload some crucial parts to CPU, I don't know), and compare that with directly killing and relaunching. If brute force is the best, then we are so cooked. 🤣
If sleep and waking up do help, then try to make up wake up and sleep APIs. Following what we did in LLM https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up . This API would be great.
Still if 2 is done, please provide an end2end time of "sleep, wake up + refit" vs "kill and relaunch". Hope this time, we can get further speed up.

Related resources

No response

Guida contributor

Direzione di ricerca: Investigare metodi per scaricare il motore SGLang Diffusion sulla CPU e ripristinarlo, confrontare con kill/relaunch tramite benchmark, e implementare API di sleep/wake up simili a SGLang LLM.
Tech stack: python
Dominio: backendinfrastructure
Tipo issue: Funzionalità
Difficoltà: 3
Tempo stimato: 1-2 giorni
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: PythonSGLangCUDA Memory Management
Adatta ai principianti: 35