sgl-project/sglang

[Kernel] cuDNN attention backend

Open

#2,272 创建于 2024年11月30日

在 GitHub 查看
 (5 评论) (1 反应) (2 负责人)Python (6,216 fork)auto 404
enhancementgood first issuehelp wantedhigh priorityinactive

仓库指标

Star
 (28,442 star)
PR 合并指标
 (平均合并 2天 1小时) (30 天内合并 1,000 个 PR)

描述

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Steps

  1. Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
  2. Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
  3. We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

贡献者指南