Description
Motivation
We plan to add LoRA (Low-Rank Adaptation) support in the SGLang JAX framework. LoRA is one of the most important PEFT (Parameter-Efficient Fine-Tuning) methods.
From the inference perspective, supporting multiple LoRA adapters within a single inference service can significantly improve efficiency and reduce costs. From the post-training perspective, compared with full fine-tuning, LoRA can greatly reduce resource requirements (no need for optimizers proportional to the total parameter size), making post-training experiments much more efficient.
Road Map
The first version Initial Support Multi-LoRA https://github.com/sgl-project/sglang-jax/pull/435 is already merged.
- Design Docs [Docs] Multi LoRA Design Documentation #321
- Release Initial Code @JamesBrianD https://github.com/sgl-project/sglang-jax/pull/432
- New interface and new configuration
- Request pipeline with
lora_path
- Core Components https://github.com/sgl-project/sglang-jax/pull/435
- Implement
LoRAAdapter@pathfinder-pf - Implement
lora_layerwith jax native kernel @aolemila - LoRAManager Implementation @pathfinder-pf
- Implement
prepare_lora_batch(),forward_batch()@JamesBrianD
- Implement
- JIT Precompilation @aolemila
- LoRA-specific precompilation
- Base Kernel
- BGMV @aolemila
- Support dynamic LoRA @JamesBrianD
- Support static LoRA @aolemila
- OpenAI compatible API
- LoRA usage documentation
- Advanced pallas kernel
- SGMV
- BGMV
- LRU evictionk policy in
LoRA Manager
- More test cases
- LoRA Layers test
- LoRA Base Accuracy test(sgl_jax vs. bonsai + qwix)
- BGMV test
- multiple-host LoRA test
We warmly welcome more suggestions and contributions from the community.
If you have ideas for a more detailed feature breakdown, design improvements, performance optimizations, or implementation strategies, feel free to share them in this issue or open a PR/discussion.
Contributions of all kinds — including design proposals, code, documentation, and benchmarking — are greatly appreciated.