[Feature]: Add support for fused fp8 output to FlashAttention 3 · vllm-project/vllm#29920

(15 comments) (0 reactions) (1 assignee)Python (16,816 forks)batch import

feature requesthelp wantedkeep-openperformancetorch.compile

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

🚀 The feature, motivation and pitch

On Hopper, we use FlashAttention as the default attention backend. When o-proj is quantized to fp8, we are leaving performance on the table as FA3 does not support fused output fp8 quant. With Triton/ROCm/AITER backends we saw up to 8% speedups with attention+quant fusion.

vLLM already maintains our own fork of FA, adding output quant support should be pretty non-intrusive. Subtasks:

vllm-flash-attn:
- add output_scale parameter to attention forward functions
- plumb parameter through all layers of the interface
- compare branching at runtime/compile-time for performance and binary size (Hopper)
vllm:
- integrate new FA version
- add support for attention+quant fusion to FA attention backend
  - check FA version, hardware version
  - should be as easy as modifying the supports_fused_output_quant method and plumbing output_scale from FlashAttentionImpl.forward() to the kernel call

Additional context

cc @LucasWilkinson

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction: Study the existing FlashAttention 3 code in vllm flash attn fork, understand the forward function signatures, and add the `output scale` parameter. Then modify the vllm attention backend to pass this parameter when conditions are met (e.g., o proj quantized to fp8).
Tech stack: python
Domain: backendai
Issue type: Feature
Difficulty: 3
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: PythonGitCUDA
Newbie friendliness: 55