vllm-project/vllm
View on GitHub[Bug][Help Wanted]: PPLX + vLLM CUTLASS FP8 Gives Incorrect Responses
Open
#33011 opened on Jan 24, 2026
bughelp wantedstale
Description
Your current environment
- B200
Issue occurs on:
- v0.13.0
- main
not sure the earliest version
🐛 Describe the bug
I get incorrect results with PPLX + vLLM CUTLASS
GPUS := "2"
PORT := "8004"
MODEL := "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
launch_dp_ep_pplx:
VLLM_USE_FLASHINFER_MOE_FP4=0 \
VLLM_USE_FLASHINFER_MOE_FP8=0 \
VLLM_USE_DEEP_GEMM=0 \
VLLM_FLASHINFER_MOE_BACKEND=throughput \
chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} -dp {{GPUS}} --enable-expert-parallel --port {{PORT}} --enforce-eager --all2all-backend pplx --max-model-len 8192
eval:
lm_eval \
--model local-completions \
--tasks gsm8k \
--model_args "model={{MODEL}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False" --limit 1000
local-completions (model=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,base_url=http://localhost:8004/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.307|± |0.0146|
| | |strict-match | 5|exact_match|↑ |0.280|± |0.0142|
expect 90%
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.