good first issue
Descrizione
Checklist
- If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Motivation
Currently, when tensor-parallelism (TP) and constrained-decoding are enabled, each TP worker will compile the same grammar across different TP ranks. This will incur non-trivial CPU overhead (for TP $n$, the overhead is $n \times$. In fact, we only need to compile the grammar and sample the result on the first rank.
A possible implementation can be:
- Only apply grammar token mask on the rank 0.
- Broadcast the next tokens id from rank 0 when there're grammars in batch.
- Keep the old code path when there's no grammar in batch (i.e. no extra all-reduce).
Related resources
No response