[Feature] Enabling both TBO and shared experts fusion
#24,690 opened on May 8, 2026
Description
Checklist
- I searched related issues but found no solution.
- The bug persists in the latest version.
- Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Describe the bug
When serving an MoE model with DeepEP, DP attention, Two-Batch-Overlap, and enforced Shared Experts Fusion, the SGLang server can hang/deadlock during concurrent serving benchmark.
The problematic combination appears to be:
--enable-two-batch-overlap--enforce-shared-experts-fusion
When both are enabled, the server hangs during bench_serve. If I remove --enforce-shared-experts-fusion while keeping Two-Batch-Overlap enabled, the server can run the benchmark successfully.
This may be related to synchronization between the TBO path and the Shared Experts Fusion path when CUDA graph execution is not active. In my configuration, DP attention is enabled, so CUDA graph capture is effectively disabled and this path runs in eager mode.
Reproduction
Launch server
The following command is a reduced reproduction.
export MODEL_ID="<MOE_MODEL_PATH>"
export HOST="127.0.0.1"
export PORT="30050"
export TP_SIZE="8"
python3 -m sglang.launch_server \
--model-path "${MODEL_ID}" \
--host "${HOST}" \
--port "${PORT}" \
--tp "${TP_SIZE}" \
--ep "${TP_SIZE}" \
--dp-size "${TP_SIZE}" \
--enable-dp-attention \
--moe-a2a-backend deepep \
--deepep-mode auto \
--enable-two-batch-overlap \
--enforce-shared-experts-fusion \
--trust-remote-code \
--log-level debug
### Environment
Python: 3.12.3
CUDA available: True
GPU: 8x NVIDIA B300 SXM6 AC or equivalent multi-GPU system
GPU Compute Capability: 10.3
CUDA_HOME: /usr/local/cuda
NVCC: CUDA 12.9
CUDA Driver Version: 580.126.16
PyTorch: 2.9.1+cu129
sglang: 0.0.0.dev11616+ga8769937d.d20260502
sglang-kernel: 0.4.2
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.13.0
orjson: 3.11.8
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
pyzmq: 27.1.0
uvicorn: 0.44.0
uvloop: 0.22.1
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0