[Kernel] cuDNN attention backend · sgl-project/sglang#2272

(5 commenti) (1 reazione) (2 assegnatari)Python (6216 fork)auto 404

enhancementgood first issuehelp wantedhigh priorityinactive

Metriche repository

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

Direzione di ricerca: Studia i backend di attenzione esistenti in sglang/srt/layers/attention, comprendi l'interfaccia ed esplora l'API Python di cuDNN paged attention linkata nel passaggio 1. Quindi implementa un nuovo backend seguendo lo schema dei backend esistenti.
Tech stack: python
Dominio: backendmachine learning
Tipo issue: Funzionalità
Difficoltà: 3
Tempo stimato: 1-2 giorni
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: PythonCUDAcuDNN
Adatta ai principianti: 60