[Bug][Help Wanted]: PPLX + vLLM CUTLASS FP8 Gives Incorrect Responses · vllm-project/vllm#33011

(6 commenti) (0 reazioni) (0 assegnatari)Python (16.816 fork)batch import

bughelp wantedstale

Metriche repository

Star: (80.034 star)
Metriche merge PR: (Merge medio 9g 2h) (921 PR mergiate in 30 g)

Descrizione

Your current environment

B200

Issue occurs on:

v0.13.0
main

not sure the earliest version

🐛 Describe the bug

I get incorrect results with PPLX + vLLM CUTLASS

GPUS := "2"
PORT := "8004"
MODEL := "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"

launch_dp_ep_pplx:
	VLLM_USE_FLASHINFER_MOE_FP4=0 \
	VLLM_USE_FLASHINFER_MOE_FP8=0 \
	VLLM_USE_DEEP_GEMM=0 \
	VLLM_FLASHINFER_MOE_BACKEND=throughput \
	chg run --gpus {{GPUS}} -- vllm serve {{MODEL}} -dp {{GPUS}} --enable-expert-parallel --port {{PORT}} --enforce-eager --all2all-backend pplx --max-model-len 8192


eval:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=1000,tokenized_requests=False" --limit 1000

local-completions (model=RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic,base_url=http://localhost:8004/v1/completions,num_concurrent=1000,tokenized_requests=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.307|±  |0.0146|
|     |       |strict-match    |     5|exact_match|↑  |0.280|±  |0.0142|

expect 90%

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Guida contributor

Direzione di ricerca: Confronta l'accuratezza con diversi backend (ad esempio, disabilita PPLX, usa il default all to all) per restringere la fonte dell'errore. Testa anche con un modello quantizzato FP8 diverso per vedere se il problema è specifico del modello. Se possibile, abilita i log di debug di CUTLASS.
Tech stack: python
Dominio: aibackend
Tipo issue: Bug
Difficoltà: 4
Tempo stimato: 1-2 giorni
Stato attività: Attiva
Chiarezza: Richiede indagine
Prerequisiti: PythonvLLMGPU
Adatta ai principianti: 20

Metriche repository

Descrizione

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Guida contributor

Ricevi issue Easy fresche nella tua inbox.