[Feature] Enabling both TBO and shared experts fusion · sgl-project/sglang#24690

(5 comments) (0 reactions) (1 assignee)Python (28,442 stars) (6,216 forks)auto 404

good first issue

Description

Checklist

I searched related issues but found no solution.
The bug persists in the latest version.
Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
Please use English. Otherwise, it will be closed.

Describe the bug

When serving an MoE model with DeepEP, DP attention, Two-Batch-Overlap, and enforced Shared Experts Fusion, the SGLang server can hang/deadlock during concurrent serving benchmark.

The problematic combination appears to be:

--enable-two-batch-overlap
--enforce-shared-experts-fusion

When both are enabled, the server hangs during bench_serve. If I remove --enforce-shared-experts-fusion while keeping Two-Batch-Overlap enabled, the server can run the benchmark successfully.

This may be related to synchronization between the TBO path and the Shared Experts Fusion path when CUDA graph execution is not active. In my configuration, DP attention is enabled, so CUDA graph capture is effectively disabled and this path runs in eager mode.

Reproduction

Launch server

The following command is a reduced reproduction.

export MODEL_ID="<MOE_MODEL_PATH>"
export HOST="127.0.0.1"
export PORT="30050"
export TP_SIZE="8"

python3 -m sglang.launch_server \
  --model-path "${MODEL_ID}" \
  --host "${HOST}" \
  --port "${PORT}" \
  --tp "${TP_SIZE}" \
  --ep "${TP_SIZE}" \
  --dp-size "${TP_SIZE}" \
  --enable-dp-attention \
  --moe-a2a-backend deepep \
  --deepep-mode auto \
  --enable-two-batch-overlap \
  --enforce-shared-experts-fusion \
  --trust-remote-code \
  --log-level debug

### Environment

Python: 3.12.3
CUDA available: True
GPU: 8x NVIDIA B300 SXM6 AC or equivalent multi-GPU system
GPU Compute Capability: 10.3
CUDA_HOME: /usr/local/cuda
NVCC: CUDA 12.9
CUDA Driver Version: 580.126.16

PyTorch: 2.9.1+cu129
sglang: 0.0.0.dev11616+ga8769937d.d20260502
sglang-kernel: 0.4.2
flashinfer_python: 0.6.7.post3
flashinfer_cubin: 0.6.7.post3
flashinfer_jit_cache: 0.6.7.post3+cu129
triton: 3.5.1
transformers: 5.3.0
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.5
fastapi: 0.135.3
huggingface_hub: 1.13.0
orjson: 3.11.8
packaging: 26.0
psutil: 7.2.2
pydantic: 2.12.5
pyzmq: 27.1.0
uvicorn: 0.44.0
uvloop: 0.22.1
xgrammar: 0.1.32
openai: 2.6.1
tiktoken: 0.12.0

Contributor guide

Tech stack: pythonpytorch
Domain: backendinfrastructure
Issue type: bug
Difficulty: 4
Estimated time: over 1 week
Activity status: active
Clarity: clear
Prerequisites: PythonCUDAPyTorch
Newbie friendliness: 10
Research direction: Investigate the deadlock between Two Batch Overlap and Shared Experts Fusion in eager mode. Trace the execution paths in the MoE kernel to identify where synchronization fails.