enhancementgood first issuehelp wantedhigh priorityinactive
Métriques du dépôt
- Stars
- (28 442 stars)
- Métriques de merge PR
- (Merge moyen 2j 1h) (1 000 PRs mergées en 30 j)
Description
cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.
Steps
- Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
- Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
- We should be able to use it with
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn