sgl-project/sglang

[Kernel] cuDNN attention backend

Open

#2.272 aberto em 30 de nov. de 2024

Ver no GitHub
 (5 comments) (1 reaction) (2 assignees)Python (6.216 forks)auto 404
enhancementgood first issuehelp wantedhigh priorityinactive

Métricas do repositório

Stars
 (28.442 stars)
Métricas de merge de PR
 (Mesclagem média 2d 1h) (1.000 fundiu PRs em 30d)

Description

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Steps

  1. Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
  2. Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
  3. We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

Guia do colaborador