sgl-project/sglang

[Kernel] cuDNN attention backend

Open

#2,272 opened on Nov 30, 2024

View on GitHub
 (5 comments) (1 reaction) (2 assignees)Python (6,216 forks)auto 404
enhancementgood first issuehelp wantedhigh priorityinactive

Repository metrics

Stars
 (28,442 stars)
PR merge metrics
 (Avg merge 2d 1h) (1,000 merged PRs in 30d)

Description

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Steps

  1. Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
  2. Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
  3. We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

Contributor guide