[Feature] Reduce constrained-decoding overhead in TP · sgl-project/sglang#13809

(1 comment) (0 reactions) (1 assignee)Python (28,442 stars) (6,216 forks)auto 404

good first issue

Description

Checklist

If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
Please use English. Otherwise, it will be closed.

Motivation

Currently, when tensor-parallelism (TP) and constrained-decoding are enabled, each TP worker will compile the same grammar across different TP ranks. This will incur non-trivial CPU overhead (for TP $n$, the overhead is $n \times$. In fact, we only need to compile the grammar and sample the result on the first rank.

A possible implementation can be:

Only apply grammar token mask on the rank 0.
Broadcast the next tokens id from rank 0 when there're grammars in batch.
Keep the old code path when there's no grammar in batch (i.e. no extra all-reduce).

https://github.com/sgl-project/sglang/blob/5c2915494c83f076a11afb2c3382eeb8a41f1974/python/sglang/srt/layers/sampler.py#L205-L217

Related resources

No response

Contributor guide

Tech stack: python
Domain: backend
Issue type: feature
Difficulty: 3
Estimated time: 1-2 days
Activity status: active
Clarity: clear
Prerequisites: PythonGit
Newbie friendliness: 35
Research direction: Implement the proposed approach: compile grammar only on rank 0, broadcast next token ids when grammars are present, and keep current code path when no grammars. Refer to the sampler.py code linked.