[Performance]: Update Cascade Attention Heuristics for FA3 · vllm-project/vllm#15647

(20 comments) (0 reactions) (1 assignee)Python (16,816 forks)batch import

feature requestgood first issuehelp wantedstaleunstale

Repository metrics

Currently, we use a heuristic (https://github.com/vllm-project/vllm/blob/4098b72210dc10761bb348b373bbd0fc9b23b0e4/vllm/v1/attention/backends/flash_attn.py#L331) to determine whether using cascade attention would improve performance. However, this heuristic was developed prior to the FA3 integration and is therefore optimized only for FA2. The calculation of SM occupancy it uses is no longer accurate for FA3 and needs updating.

Specifically,

We need to split the case for FA2 and FA3, since FA2 is still used for certain GPUs.
Afaik, FA3 uses different heuristics for GQA packing and split kv than FA2. The heuristics in use_cascade should reflect this difference (although it doesn't need to be super accurate).
It'd be nice if we can cite specific lines of code in FA3 (and FA2) deciding the tile sizes, scheduling, etc., so that we can easily verify and track.

No response

No response

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Research direction: Examine the current heuristic in vllm/v1/attention/backends/flash attn.py and research FA3's tiling and scheduling heuristics to update the cascade attention decision logic.
Tech stack: python
Domain: backend
Issue type: Performance
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: PythonPyTorch
Newbie friendliness: 30