vllm-project/vllm
View on GitHub[Performance]: Update Cascade Attention Heuristics for FA3
Open
#15,647 opened on Mar 27, 2025
feature requestgood first issuehelp wantedstaleunstale
Description
🚀 The feature, motivation and pitch
Currently, we use a heuristic (https://github.com/vllm-project/vllm/blob/4098b72210dc10761bb348b373bbd0fc9b23b0e4/vllm/v1/attention/backends/flash_attn.py#L331) to determine whether using cascade attention would improve performance. However, this heuristic was developed prior to the FA3 integration and is therefore optimized only for FA2. The calculation of SM occupancy it uses is no longer accurate for FA3 and needs updating.
Specifically,
- We need to split the case for FA2 and FA3, since FA2 is still used for certain GPUs.
- Afaik, FA3 uses different heuristics for GQA packing and split kv than FA2. The heuristics in use_cascade should reflect this difference (although it doesn't need to be super accurate).
- It'd be nice if we can cite specific lines of code in FA3 (and FA2) deciding the tile sizes, scheduling, etc., so that we can easily verify and track.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.