sgl-project/sglang

[Feature] Reduce constrained-decoding overhead in TP

Open

#13,809 opened on Nov 23, 2025

View on GitHub
 (1 comment) (0 reactions) (1 assignee)Python (28,442 stars) (6,216 forks)auto 404
good first issue

Description

Checklist

Motivation

Currently, when tensor-parallelism (TP) and constrained-decoding are enabled, each TP worker will compile the same grammar across different TP ranks. This will incur non-trivial CPU overhead (for TP $n$, the overhead is $n \times$. In fact, we only need to compile the grammar and sample the result on the first rank.

A possible implementation can be:

  1. Only apply grammar token mask on the rank 0.
  2. Broadcast the next tokens id from rank 0 when there're grammars in batch.
  3. Keep the old code path when there's no grammar in batch (i.e. no extra all-reduce).

https://github.com/sgl-project/sglang/blob/5c2915494c83f076a11afb2c3382eeb8a41f1974/python/sglang/srt/layers/sampler.py#L205-L217

Related resources

No response

Contributor guide