[Bug]: 0% Acceptance rate with FI Cutlass DeepSeekR1 NVFP4 with mtp ep

(4 comments) (0 reactions) (0 assignees)Python (16,816 forks)batch import

bughelp wantedstale

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

Your current environment

b200

🐛 Describe the bug

launch_cutlass_moe_trtllm_attn_fused_ar_rope_fp8_kv_ep_spec_decode:
	VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND="throughput" CUDA_VISIBLE_DEVICES=1,3,0,5 vllm serve {{MODEL}} -tp {{GPUS}} --port {{PORT}} \
		--attention-config.use_trtllm_ragged_deepseek_prefill=True --attention-backend FLASHINFER_MLA \
		--compilation_config.pass_config.fuse_allreduce_rms true \
		--compilation_config.custom_ops+=+rotary_embedding \
		--kv-cache-dtype fp8 --compilation_config.pass_config.fuse_attn_quant true \
		--enable-expert-parallel --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}'

(APIServer pid=2660027) INFO 01-08 23:07:23 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.02, Accepted throughput: 10.00 tokens/s, Drafted throughput: 1363.65 tokens/s, Accepted: 100 tokens, Drafted: 13638 tokens, Per-position acceptance rate: 0.022, 0.000, 0.000, Avg Draft acceptance rate: 0.7%

It works fine with either:

VLLM_FLASHINFER_MOE_BACKEND="latency"
no --enable-expert-parallel

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction
Tech stack
Domain
Issue type
Difficulty
Estimated time
Activity status
Clarity
Prerequisites
Newbie friendliness

Repository metrics

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Contributor guide

Get fresh easy issues in your inbox.