vllm-project/vllm

[Feature]: Add Triton implementation of NVFP4 GEMM

Open

#21,014 opened on Jul 15, 2025

View on GitHub
 (10 comments) (1 reaction) (0 assignees)Python (80,034 stars) (16,816 forks)batch import
feature requestgood first issueperformanceunstale

Description

🚀 The feature, motivation and pitch

Currently we only have NVFP4 GEMMs written in CUTLASS for SM100, which means we have no support for SM120. While we still expect tuned CUTLASS kernels to provide the best performance, it would be nice to have a reference Triton implementation available as a fallback if no other kernels are available.

It seems Triton has supported FP4 formats for several months so we should have a new enough version of Triton https://github.com/triton-lang/triton/blob/620237edd282d3fa275e7f931af2018423108c4a/python/test/unit/language/test_matmul.py#L652

A good starting point would be to add the triton kernel directly to https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/bench_nvfp4_gemm.py and compare the performance to the CUTLASS kernel, before integrating with vLLM as a whole

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide