sgl-project/sglang

[Kernel] cuDNN attention backend

Open

#2 272 ouverte le 30 nov. 2024

Voir sur GitHub
 (5 commentaires) (1 réaction) (2 assignés)Python (6 216 forks)auto 404
enhancementgood first issuehelp wantedhigh priorityinactive

Métriques du dépôt

Stars
 (28 442 stars)
Métriques de merge PR
 (Merge moyen 2j 1h) (1 000 PRs mergées en 30 j)

Description

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Steps

  1. Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
  2. Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
  3. We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

Guide contributeur