[Feature] Reduce constrained-decoding overhead in TP · sgl-project/sglang#13809

(1 commento) (0 reazioni) (1 assegnatario)Python (28.442 star) (6216 fork)auto 404

good first issue

Descrizione

Checklist

If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
Please use English. Otherwise, it will be closed.

Motivation

Currently, when tensor-parallelism (TP) and constrained-decoding are enabled, each TP worker will compile the same grammar across different TP ranks. This will incur non-trivial CPU overhead (for TP $n$, the overhead is $n \times$. In fact, we only need to compile the grammar and sample the result on the first rank.

A possible implementation can be:

Only apply grammar token mask on the rank 0.
Broadcast the next tokens id from rank 0 when there're grammars in batch.
Keep the old code path when there's no grammar in batch (i.e. no extra all-reduce).

https://github.com/sgl-project/sglang/blob/5c2915494c83f076a11afb2c3382eeb8a41f1974/python/sglang/srt/layers/sampler.py#L205-L217

Related resources

No response

Guida contributor

Tech stack: python
Dominio: backend
Tipo issue: feature
Difficoltà: 3
Tempo stimato: 1-2 days
Stato attività: active
Chiarezza: clear
Prerequisiti: PythonGit
Adatta ai principianti: 35
Direzione di ricerca: Implementa l'approccio proposto: compila la grammatica solo sul rank 0, trasmetti i prossimi ID token quando ci sono grammatiche, e mantieni il percorso di codice attuale quando non ci sono grammatiche. Fai riferimento al codice sampler.py collegato.