[Feature] Multi-Token Prediction (MTP) support for sgl-jax · sgl-project/sglang-jax#192

(3 commenti) (1 reazione) (1 assegnatario)Python (276 star) (101 fork)auto 404

SpecDecodeenhancementhelp wanted

Descrizione

Motivation

Current autoregressive language models generate tokens sequentially, which creates inherent bottlenecks in inference throughput. While speculative decoding techniques like EAGLE improve performance through draft-verify mechanisms, they still rely on single-token predictions from the base model. Multi-Token Prediction addresses this limitation by enabling the model to directly predict multiple tokens, reducing the number of forward passes required for sequence generation.

Road Map

Eagle Worker main process @SiqiLi-Fighting
- https://github.com/sgl-project/sglang-jax/pull/378
- adapted tree mask to rpa v3 attention kernel
- add bigram key prefix cache to radix cache
Performance Optimization @SiqiLi-Fighting
- [Feature] Eagle Performance Optimazation #436
- eagle topk = 1, JIT functional optimization
- eagle topk > 1, bulid tree mask kernel at draft decode stage
- Compatibility with SchedulerOverlap
- non greedy sampling kernel implement @SiqiLi-Fighting
More Speculative algorithms Support (Call for Contribution)
- ngram algorithms like PLD/Suffix Decoding/ LookAhead adaptation

Guida contributor

Tech stack: python
Dominio: machine learningaibackend
Tipo issue: feature
Difficoltà: 4
Tempo stimato: over 1 week
Stato attività: active
Chiarezza: clear
Prerequisiti: PythonJAXMachine Learning
Adatta ai principianti: 30
Direzione di ricerca: Studiare l'implementazione esistente di Eagle worker in sglang jax, comprendere la maschera ad albero e il kernel di attenzione, quindi progettare un meccanismo di previsione multi token in grado di prevedere più token per passaggio in avanti.