good first issue
Description
Now we have many kernel backends for moe (e.g., triton, cutlass), attention, and allreduce. For each kernel, we can also tune some configs (e.g., the tile sizes in triton fused moe). Tuning these kernels and choosing dispatching heuristics can be non-trivial. We would like to build a simple command that auto-tune all the kernels and dispatch heuristic for a model.
Todo
- Implement a script sglang.auto_tune to tune the kernels and hyperparameters for a specific model. It should dump the optimal tile sizes/dispatching heuristics for all kernels used in this model. Example usage:
python3 -m sglang.auto_tune --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --tp 8python3 -m sglang.auto_tune --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 --tp 4
- Start from tuning the triton fused moe https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
- Implement a github action workflow that tunes 20 popular models on 5 common platforms.
- The workflow should take in two arguments: a list of model names, and a list of runner names
- Our CI has H100, H20, H200, B200, GB200
- Auto choose allreduce algorithms (custom allreduce, nccl, nccl symmetric memory, torch symmetric memory).
- Auto choose attention kernel backend and moe runner backend.
- Auto tune the cutlass gemm kernels (with cutlass profiler)
- Auto choose speculative decoding parameters for different batch sizes.