[Bug][Help Wanted]: PPLX + vLLM CUTLASS FP8 Gives Incorrect Responses · vllm-project/vllm#33011

Métricas do repositório

Stars: (80.034 stars)
Métricas de merge de PR: (Mesclagem média 9d 2h) (921 fundiu PRs em 30d)

Description

Your current environment

B200

Issue occurs on:

v0.13.0
main

not sure the earliest version

🐛 Describe the bug

I get incorrect results with PPLX + vLLM CUTLASS

GPUS := "2"
PORT := "8004"
MODEL := "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"

launch_dp_ep_pplx:
	VLLM_USE_FLASHINFER_MOE_FP4=0 \
	VLLM_USE_FLASHINFER_MOE_FP8=0 \
	VLLM_USE_DEEP_GEMM=0 \
	VLLM_FLASHINFER_MOE_BACKEND=throughput \
	chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} -dp {{GPUS}} --enable-expert-parallel --port {{PORT}} --enforce-eager --all2all-backend pplx --max-model-len 8192


eval:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False" --limit 1000

local-completions (model=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,base_url=http://localhost:8004/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.307|±  |0.0146|
|     |       |strict-match    |     5|exact_match|↑  |0.280|±  |0.0142|

expect 90%

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Guia do colaborador

Direção de pesquisa: Compare a precisão com diferentes backends (por exemplo, desabilite PPLX, use o all to all padrão) para restringir a fonte do erro. Teste também com um modelo quantizado FP8 diferente para ver se o problema é específico do modelo. Se possível, ative os logs de depuração do CUTLASS.
Pilha de tecnologia: python
Domain: aibackend
Tipo Issue: Bug
Difficulty: 4
Tempo estimado: 1-2 dias
Status da atividade: Ativo
Clarity: Precisa de investigação
Prerequisites: PythonvLLMGPU
Simpatia para novatos: 20

Métricas do repositório

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Guia do colaborador

Receba issues Easy novas por email.