[Feature] Overlap Spec Support · sgl-project/sglang#11762

Repository metrics

Stars: (28,442 stars)
PR merge metrics: (Avg merge 2d 1h) (1,000 merged PRs in 30d)

Description

Motivation

We have already implemented the initial support for eagle speculative decoding with the overlap scheduler, and here is the roadmap for more feature optimizations and support. The initial skeleton code is this PR https://github.com/sgl-project/sglang/pull/11398

The design illustration is here

[!NOTE] The arg --enable-beta-spec has been deprecated, please use export SGLANG_ENABLE_SPEC_V2=1 to enable this feature.

page size & topk support

Support page size > 1 @cicirori @hnyls2002 #11772
Support topk > 1 @vincentzed #11839
Support topk > 1 + page size > 1 @vincentzed

memory allocation

over-allocation optimization @hnyls2002
over-allocation with page size > 1 + topk > 1

Attention backend support

Remove or make verify_done.synchronize() an option @hnyls2002
Different attention backend support @Fridge003 @Qiaolin-Yu
- https://github.com/sgl-project/sglang/pull/11821/
- https://github.com/sgl-project/sglang/pull/11874

sampling

https://github.com/sgl-project/sglang/issues/13019
penalty support
logprob support

speculative methods

new speculative model worker interface (https://github.com/sgl-project/sglang/pull/11643)
standalone speculative support @Qiaolin-Yu
ngram speculative support @a4zhangfei
Top SpecTpWorker for all speculative decoding backends @hnyls2002
Make SpecTpWorker compatible with all TpModelWorker features.
specialize for high throughput case (num_step=1, topk=1, num_verify_forward_pass_tokens=2) @yukavio

DP attention support

Support idle batch @iforgetmyname
- #12443
cover testcases with dp-attention + overlap + spec @iforgetmyname

EP support

Check compatibility with DeepEP / EP @fzyzcjy
Cover testcases with EP + overlap + spec @fzyzcjy

PD disaggregation

Event loop adjust in Prefill / Decode worker @shaharmor98
Cover testcases with PD-Disagg + overlap + spec @ShangmingCai

LoRA Support

@lifuhuang
https://github.com/sgl-project/sglang/pull/12903

Aggressive Optimizations

Enable a separate `plan_stream

Related resources

No response

Contributor guide

Research direction: Examine the existing overlap spec implementation and understand the speculative decoding architecture in SGLang. Review the linked PRs and design document to get an overview of the components involved.
Tech stack: python
Domain: backend
Issue type: Feature
Difficulty: 4
Estimated time: Over 1 week
Activity status: Active
Clarity: Mostly clear
Prerequisites: Python
Newbie friendliness: 15