sgl-project/sglang

[Kernel] cuDNN attention backend

Open

#2,272 opened on 2024年11月30日

GitHub で見る
 (5 comments) (1 reaction) (2 assignees)Python (6,216 forks)auto 404
enhancementgood first issuehelp wantedhigh priorityinactive

Repository metrics

Stars
 (28,442 stars)
PR merge metrics
 (平均マージ 2d 1h) (30d で 1,000 merged PRs)

説明

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Steps

  1. Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
  2. Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
  3. We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

コントリビューターガイド