sgl-project/sglang

[Feature] Reduce constrained-decoding overhead in TP

Open

Aperta il 23 nov 2025

Vedi su GitHub
 (1 commento) (0 reazioni) (1 assegnatario)Python (28.442 star) (6216 fork)auto 404
good first issue

Descrizione

Checklist

Motivation

Currently, when tensor-parallelism (TP) and constrained-decoding are enabled, each TP worker will compile the same grammar across different TP ranks. This will incur non-trivial CPU overhead (for TP $n$, the overhead is $n \times$. In fact, we only need to compile the grammar and sample the result on the first rank.

A possible implementation can be:

  1. Only apply grammar token mask on the rank 0.
  2. Broadcast the next tokens id from rank 0 when there're grammars in batch.
  3. Keep the old code path when there's no grammar in batch (i.e. no extra all-reduce).

https://github.com/sgl-project/sglang/blob/5c2915494c83f076a11afb2c3382eeb8a41f1974/python/sglang/srt/layers/sampler.py#L205-L217

Related resources

No response

Guida contributor