sgl-project/sglang

[Feature] Overlap Spec Support

Open

#11,762 opened on Oct 17, 2025

View on GitHub
 (24 comments) (25 reactions) (2 assignees)Python (28,442 stars) (6,216 forks)auto 404
enhancementgood first issuehelp wantedhigh priority

Description

Motivation

We have already implemented the initial support for eagle speculative decoding with the overlap scheduler, and here is the roadmap for more feature optimizations and support. The initial skeleton code is this PR https://github.com/sgl-project/sglang/pull/11398

The design illustration is here

[!NOTE] The arg --enable-beta-spec has been deprecated, please use export SGLANG_ENABLE_SPEC_V2=1 to enable this feature.


page size & topk support

  • Support page size > 1 @cicirori @hnyls2002 #11772
  • Support topk > 1 @vincentzed #11839
  • Support topk > 1 + page size > 1 @vincentzed

memory allocation

  • over-allocation optimization @hnyls2002
  • over-allocation with page size > 1 + topk > 1

Attention backend support

sampling

speculative methods

  • new speculative model worker interface (https://github.com/sgl-project/sglang/pull/11643)
  • standalone speculative support @Qiaolin-Yu
  • ngram speculative support @a4zhangfei
  • Top SpecTpWorker for all speculative decoding backends @hnyls2002
  • Make SpecTpWorker compatible with all TpModelWorker features.
  • specialize for high throughput case (num_step=1, topk=1, num_verify_forward_pass_tokens=2) @yukavio

DP attention support

  • Support idle batch @iforgetmyname
    • #12443
  • cover testcases with dp-attention + overlap + spec @iforgetmyname

EP support

  • Check compatibility with DeepEP / EP @fzyzcjy
  • Cover testcases with EP + overlap + spec @fzyzcjy

PD disaggregation

  • Event loop adjust in Prefill / Decode worker @shaharmor98
  • Cover testcases with PD-Disagg + overlap + spec @ShangmingCai

LoRA Support

Aggressive Optimizations

  • Enable a separate `plan_stream

Related resources

No response

Contributor guide