vllm-project/vllm

[Feature]: Add support for fused fp8 output to FlashAttention 3

Open

#29920 opened on Dec 2, 2025

View on GitHub
 (15 comments) (0 reactions) (1 assignee)Python (80,034 stars) (16,816 forks)batch import
feature requesthelp wantedkeep-openperformancetorch.compile

Description

🚀 The feature, motivation and pitch

On Hopper, we use FlashAttention as the default attention backend. When o-proj is quantized to fp8, we are leaving performance on the table as FA3 does not support fused output fp8 quant. With Triton/ROCm/AITER backends we saw up to 8% speedups with attention+quant fusion.

vLLM already maintains our own fork of FA, adding output quant support should be pretty non-intrusive. Subtasks:

  • vllm-flash-attn:

    • add output_scale parameter to attention forward functions
    • plumb parameter through all layers of the interface
    • compare branching at runtime/compile-time for performance and binary size (Hopper)
  • vllm:

    • integrate new FA version
    • add support for attention+quant fusion to FA attention backend
      • check FA version, hardware version
      • should be as easy as modifying the supports_fused_output_quant method and plumbing output_scale from FlashAttentionImpl.forward() to the kernel call

Additional context

cc @LucasWilkinson

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide