[Feature]: Add Triton implementation of NVFP4 GEMM
#21,014 opened on Jul 15, 2025
Description
🚀 The feature, motivation and pitch
Currently we only have NVFP4 GEMMs written in CUTLASS for SM100, which means we have no support for SM120. While we still expect tuned CUTLASS kernels to provide the best performance, it would be nice to have a reference Triton implementation available as a fallback if no other kernels are available.
It seems Triton has supported FP4 formats for several months so we should have a new enough version of Triton https://github.com/triton-lang/triton/blob/620237edd282d3fa275e7f931af2018423108c4a/python/test/unit/language/test_matmul.py#L652
A good starting point would be to add the triton kernel directly to https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/bench_nvfp4_gemm.py and compare the performance to the CUTLASS kernel, before integrating with vLLM as a whole
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.