sgl-project/sglang

[Kernel] cuDNN attention backend

Open

#2272 aperta il 30 nov 2024

Vedi su GitHub
 (5 commenti) (1 reazione) (2 assegnatari)Python (6216 fork)auto 404
enhancementgood first issuehelp wantedhigh priorityinactive

Metriche repository

Star
 (28.442 star)
Metriche merge PR
 (Merge medio 2g 1h) (1000 PR mergiate in 30 g)

Descrizione

cuDNN provides very fast attention implementation and it is well maintained by NVIDIA. We would like to add a new attention backend based on cudnn.

Steps

  1. Learn this cudnn paged attention python api. https://github.com/NVIDIA/cudnn-frontend/blob/v1.8.0/samples/python/52_scaled_dot_product_attention_with_paged_caches.ipynb
  2. Add a new attention backend "cudnn" here https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/layers/attention
  3. We should be able to use it with python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --attention-backend cudnn

Guida contributor