[Kernel] cuDNN attention backend · sgl-project/sglang#2272

(5 comments) (1 reaction) (2 assignees)Python (6.216 forks)auto 404

enhancementgood first issuehelp wantedhigh priorityinactive

Métricas do repositório

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

Direção de pesquisa: Estude os backends de atenção existentes em sglang/srt/layers/attention, compreenda a interface e explore a API Python de atenção paginada do cuDNN vinculada na etapa 1. Em seguida, implemente um novo backend seguindo o padrão dos backends existentes.
Pilha de tecnologia: python
Domain: backendmachine learning
Tipo Issue: Funcionalidade
Difficulty: 3
Tempo estimado: 1-2 dias
Status da atividade: Ativo
Clarity: Claro
Prerequisites: PythonCUDAcuDNN
Simpatia para novatos: 60