vllm-project/vllm
View on GitHub[Feature]: Add support for fused fp8 output to FlashAttention 3
Open
#29920 opened on Dec 2, 2025
feature requesthelp wantedkeep-openperformancetorch.compile
Description
🚀 The feature, motivation and pitch
On Hopper, we use FlashAttention as the default attention backend. When o-proj is quantized to fp8, we are leaving performance on the table as FA3 does not support fused output fp8 quant. With Triton/ROCm/AITER backends we saw up to 8% speedups with attention+quant fusion.
vLLM already maintains our own fork of FA, adding output quant support should be pretty non-intrusive. Subtasks:
-
vllm-flash-attn:
- add
output_scaleparameter to attention forward functions - plumb parameter through all layers of the interface
- compare branching at runtime/compile-time for performance and binary size (Hopper)
- add
-
vllm:
- integrate new FA version
- add support for attention+quant fusion to FA attention backend
- check FA version, hardware version
- should be as easy as modifying the
supports_fused_output_quantmethod and plumbingoutput_scalefromFlashAttentionImpl.forward()to the kernel call
Additional context
cc @LucasWilkinson
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.