[Roadmap] sglang auto tuner · sgl-project/sglang#13363

Repository metrics

Stars: (28,442 stars)
PR merge metrics: (Avg merge 2d 1h) (1,000 merged PRs in 30d)

Description

Now we have many kernel backends for moe (e.g., triton, cutlass), attention, and allreduce. For each kernel, we can also tune some configs (e.g., the tile sizes in triton fused moe). Tuning these kernels and choosing dispatching heuristics can be non-trivial. We would like to build a simple command that auto-tune all the kernels and dispatch heuristic for a model.

Todo

Implement a script sglang.auto_tune to tune the kernels and hyperparameters for a specific model. It should dump the optimal tile sizes/dispatching heuristics for all kernels used in this model. Example usage:
- python3 -m sglang.auto_tune --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --tp 8
- python3 -m sglang.auto_tune --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --tp 4
Start from tuning the triton fused moe https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Implement a github action workflow that tunes 20 popular models on 5 common platforms.
- The workflow should take in two arguments: a list of model names, and a list of runner names
- Our CI has H100, H20, H200, B200, GB200
Auto choose allreduce algorithms (custom allreduce, nccl, nccl symmetric memory, torch symmetric memory).
Auto choose attention kernel backend and moe runner backend.
Auto tune the cutlass gemm kernels (with cutlass profiler)
Auto choose speculative decoding parameters for different batch sizes.

Contributor guide

Research direction: Study the fused moe triton benchmark and implement the auto tune script for that kernel as a starting point.
Tech stack: python
Domain: backendperformance
Issue type: Feature
Difficulty: 5
Estimated time: Over 1 week
Activity status: Active
Clarity: Clear
Prerequisites: PythonGitGPU programming
Newbie friendliness: 25

Repository metrics

Description

Todo

Contributor guide

Get fresh easy issues in your inbox.