vllm-project/vllm
View on GitHub[Bug]: 0% Acceptance rate with FI Cutlass DeepSeekR1 NVFP4 with mtp ep
Closed
#32009 opened on Jan 9, 2026
bughelp wantedstale
Description
Your current environment
b200
🐛 Describe the bug
launch_cutlass_moe_trtllm_attn_fused_ar_rope_fp8_kv_ep_spec_decode:
VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND="throughput" CUDA_VISIBLE_DEVICES=1,3,0,5 vllm serve {{MODEL}} -tp {{GPUS}} --port {{PORT}} \
--attention-config.use_trtllm_ragged_deepseek_prefill=True --attention-backend FLASHINFER_MLA \
--compilation_config.pass_config.fuse_allreduce_rms true \
--compilation_config.custom_ops+=+rotary_embedding \
--kv-cache-dtype fp8 --compilation_config.pass_config.fuse_attn_quant true \
--enable-expert-parallel --speculative-config '{"num_speculative_tokens": 3, "method":"deepseek_mtp"}'
(APIServer pid=2660027) INFO 01-08 23:07:23 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.02, Accepted throughput: 10.00 tokens/s, Drafted throughput: 1363.65 tokens/s, Accepted: 100 tokens, Drafted: 13638 tokens, Per-position acceptance rate: 0.022, 0.000, 0.000, Avg Draft acceptance rate: 0.7%
It works fine with either:
VLLM_FLASHINFER_MOE_BACKEND="latency"- no
--enable-expert-parallel
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.