sgl-project/sglang

[Feature] Support offload and wake up of SGLang Diffusion

Open

Aperta il 20 feb 2026

Vedi su GitHub
 (5 commenti) (0 reazioni) (0 assegnatari)Python (28.442 star) (6216 fork)auto 404
good first issue

Descrizione

Checklist

Motivation

In the LLM RL scenario, sleeping and waking up an SGLang server is widely used and optimized in co-located placement. As detailed in Biao @hebiao064 blog: https://hebiao064.github.io/rl-memory-management

In LLM RL, we use torch_memory_savor to protect the virtual address of the SGLang LLM server in order to keep CUDA Graph alive. Right now in SGLang Diffusion, CUDA Graph is not supported (working on it by @zyksir ), in this sense. We may have more brute fore method to sleep and wake up. In extreme situations, we can even kill and relaunch the SGLang Diffusion server, and the relaunch time is profiled in https://github.com/sgl-project/sglang/issues/19087

In this sense, we may need a way to sleep and wake up SGLang Diffusion. The optimal API should be similar to https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up , but the start point can be more brute force.

If I let myself handle this issue myself, I will break this down into the following steps:

  1. Try out the brute force way to sleep and wake up the SGLang Diffusion Server (like offload some crucial parts to CPU, I don't know), and compare that with directly killing and relaunching. If brute force is the best, then we are so cooked. 🤣
  2. If sleep and waking up do help, then try to make up wake up and sleep APIs. Following what we did in LLM https://docs.sglang.io/advanced_features/sglang_for_rl.html#fine-grained-engine-sleep-and-wake-up . This API would be great.
  3. Still if 2 is done, please provide an end2end time of "sleep, wake up + refit" vs "kill and relaunch". Hope this time, we can get further speed up.

Related resources

No response

Guida contributor