[Feature]: Add Triton implementation of NVFP4 GEMM · vllm-project/vllm#21014

(10 comments) (1 reaction) (0 assignees)Python (16,816 forks)batch import

feature requestgood first issueperformanceunstale

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

🚀 The feature, motivation and pitch

Currently we only have NVFP4 GEMMs written in CUTLASS for SM100, which means we have no support for SM120. While we still expect tuned CUTLASS kernels to provide the best performance, it would be nice to have a reference Triton implementation available as a fallback if no other kernels are available.

It seems Triton has supported FP4 formats for several months so we should have a new enough version of Triton https://github.com/triton-lang/triton/blob/620237edd282d3fa275e7f931af2018423108c4a/python/test/unit/language/test_matmul.py#L652

A good starting point would be to add the triton kernel directly to https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/bench_nvfp4_gemm.py and compare the performance to the CUTLASS kernel, before integrating with vLLM as a whole

Alternatives

No response

Additional context