[Performance]: qknorm+rope fusion slower than unfused on H100 · vllm-project/vllm#34391

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

Proposal to improve performance

Running vllm bench sweep serve for -cc.pass_config.enable_qknorm_rope_fusion in {True, False} gives the following results:

# base command
vllm bench sweep serve \
    --serve-cmd "vllm serve --model $MODEL --tensor-parallel-size $TP --port $PORT --no-enable-prefix-caching" \
    --bench-cmd "vllm bench serve --dataset-name random --ignore-eos --model=$MODEL --port $PORT" \
    --bench-params sweep-qps.json \
    --serve-params sweep-qknorm-fusion.json \

# sweep-qps.json
[
  {
    "num-prompts": 120,
    "request-rate": 1
  },{
    "num-prompts": 600,
    "request-rate": 5
  },{
    "num-prompts": 1200,
    "request-rate": 10
  },{
    "num-prompts": 1800,
    "request-rate": 15
  },{
    "num-prompts": 2400,
    "request-rate": 20
  },{
    "num-prompts": 1000,
    "request-rate": "inf"
  }
]

# sweep-qknorm-fusion.json
{
  "fused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": True
      }
    }
  },
  "unfused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": False
      }
    }
  }
}

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction: Profile the fused QK norm and RoPE kernel on H100 to compare its performance with the unfused version. Identify the bottleneck (e.g., memory access pattern, compute utilization) and consider whether the fusion pass needs optimization or should be disabled for certain GPUs.
Tech stack: python
Domain: aimachine learningperformance
Issue type: Performance
Difficulty: 4
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: PythonGPU computing conceptsvllm internals
Newbie friendliness: 25

Repository metrics

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Contributor guide

Get fresh easy issues in your inbox.