vllm-project/vllm

[Performance]: qknorm+rope fusion slower than unfused on H100

Open

#34.391 aberto em 12 de fev. de 2026

Ver no GitHub
 (12 comments) (1 reaction) (1 assignee)Python (16.816 forks)batch import
help wantedperformancetorch.compile

Métricas do repositório

Stars
 (80.034 stars)
Métricas de merge de PR
 (Mesclagem média 9d 2h) (921 fundiu PRs em 30d)

Description

Proposal to improve performance

Running vllm bench sweep serve for -cc.pass_config.enable_qknorm_rope_fusion in {True, False} gives the following results:

# base command
vllm bench sweep serve \
    --serve-cmd "vllm serve --model $MODEL --tensor-parallel-size $TP --port $PORT --no-enable-prefix-caching" \
    --bench-cmd "vllm bench serve --dataset-name random --ignore-eos --model=$MODEL --port $PORT" \
    --bench-params sweep-qps.json \
    --serve-params sweep-qknorm-fusion.json \

# sweep-qps.json
[
  {
    "num-prompts": 120,
    "request-rate": 1
  },{
    "num-prompts": 600,
    "request-rate": 5
  },{
    "num-prompts": 1200,
    "request-rate": 10
  },{
    "num-prompts": 1800,
    "request-rate": 15
  },{
    "num-prompts": 2400,
    "request-rate": 20
  },{
    "num-prompts": 1000,
    "request-rate": "inf"
  }
]

# sweep-qknorm-fusion.json
{
  "fused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": True
      }
    }
  },
  "unfused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": False
      }
    }
  }
}

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Guia do colaborador