[Bug][Help Wanted]: PPLX + vLLM CUTLASS FP8 Gives Incorrect Responses

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

Your current environment

B200

Issue occurs on:

v0.13.0
main

not sure the earliest version

🐛 Describe the bug

I get incorrect results with PPLX + vLLM CUTLASS

GPUS := "2"
PORT := "8004"
MODEL := "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"

launch_dp_ep_pplx:
	VLLM_USE_FLASHINFER_MOE_FP4=0 \
	VLLM_USE_FLASHINFER_MOE_FP8=0 \
	VLLM_USE_DEEP_GEMM=0 \
	VLLM_FLASHINFER_MOE_BACKEND=throughput \
	chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} -dp {{GPUS}} --enable-expert-parallel --port {{PORT}} --enforce-eager --all2all-backend pplx --max-model-len 8192


eval:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False" --limit 1000

local-completions (model=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,base_url=http://localhost:8004/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.307|±  |0.0146|
|     |       |strict-match    |     5|exact_match|↑  |0.280|±  |0.0142|

expect 90%

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction: Compare the accuracy with different backends (e.g., disable PPLX, use default all to all) to narrow down the source of the error. Also test with a different FP8 quantized model to see if the issue is model specific. Consider enabling CUTLASS debug logs if available.
Tech stack: python
Domain: aibackend
Issue type: Bug
Difficulty: 4
Estimated time: 1-2 days
Activity status: Active
Clarity: Needs investigation
Prerequisites: PythonvLLMGPU
Newbie friendliness: 20

Repository metrics

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Contributor guide

Get fresh easy issues in your inbox.