enhancementgood first issuehelp wantedhigh priority
Description
Motivation
We have already implemented the initial support for eagle speculative decoding with the overlap scheduler, and here is the roadmap for more feature optimizations and support. The initial skeleton code is this PR https://github.com/sgl-project/sglang/pull/11398
The design illustration is here
[!NOTE] The arg
--enable-beta-spechas been deprecated, please useexport SGLANG_ENABLE_SPEC_V2=1to enable this feature.
page size & topk support
- Support page size > 1 @cicirori @hnyls2002 #11772
- Support topk > 1 @vincentzed #11839
- Support topk > 1 + page size > 1 @vincentzed
memory allocation
- over-allocation optimization @hnyls2002
- over-allocation with page size > 1 + topk > 1
Attention backend support
- Remove or make
verify_done.synchronize()an option @hnyls2002 - Different attention backend support @Fridge003 @Qiaolin-Yu
sampling
- https://github.com/sgl-project/sglang/issues/13019
- penalty support
- logprob support
speculative methods
- new speculative model worker interface (https://github.com/sgl-project/sglang/pull/11643)
- standalone speculative support @Qiaolin-Yu
- ngram speculative support @a4zhangfei
- Top
SpecTpWorkerfor all speculative decoding backends @hnyls2002 - Make
SpecTpWorkercompatible with allTpModelWorkerfeatures. - specialize for high throughput case (num_step=1, topk=1, num_verify_forward_pass_tokens=2) @yukavio
DP attention support
- Support idle batch @iforgetmyname
- #12443
- cover testcases with dp-attention + overlap + spec @iforgetmyname
EP support
- Check compatibility with DeepEP / EP @fzyzcjy
- Cover testcases with EP + overlap + spec @fzyzcjy
PD disaggregation
- Event loop adjust in Prefill / Decode worker @shaharmor98
- Cover testcases with PD-Disagg + overlap + spec @ShangmingCai
LoRA Support
Aggressive Optimizations
- Enable a separate `plan_stream
Related resources
No response