[Kernel] cuDNN attention backend · sgl-project/sglang#2272

(5 commentaires) (1 réaction) (2 assignés)Python (6 216 forks)auto 404

enhancementgood first issuehelp wantedhigh priorityinactive

Métriques du dépôt

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

Direction de recherche: Étudiez les backends d'attention existants dans sglang/srt/layers/attention, comprenez l'interface et explorez l'API Python d'attention paginée cuDNN liée à l'étape 1. Ensuite, implémentez un nouveau backend en suivant le modèle des backends existants.
Stack technique: python
Domaine: backendmachine learning
Type d'issue: Fonctionnalité
Difficulté: 3
Temps estimé: 1-2 jours
Statut d'activité: Active
Clarté: Claire
Prérequis: PythonCUDAcuDNN
Accessibilité débutant: 60