[Performance]: qknorm+rope fusion slower than unfused on H100 · vllm-project/vllm#34391

Repository-Metriken

Stars: (80.034 Stars)
PR-Merge-Metriken: (Durchschn. Merge 9T 2h) (921 gemergte PRs in 30 T)

Beschreibung

Proposal to improve performance

Running vllm bench sweep serve for -cc.pass_config.enable_qknorm_rope_fusion in {True, False} gives the following results:

# base command
vllm bench sweep serve \
    --serve-cmd "vllm serve --model $MODEL --tensor-parallel-size $TP --port $PORT --no-enable-prefix-caching" \
    --bench-cmd "vllm bench serve --dataset-name random --ignore-eos --model=$MODEL --port $PORT" \
    --bench-params sweep-qps.json \
    --serve-params sweep-qknorm-fusion.json \

# sweep-qps.json
[
  {
    "num-prompts": 120,
    "request-rate": 1
  },{
    "num-prompts": 600,
    "request-rate": 5
  },{
    "num-prompts": 1200,
    "request-rate": 10
  },{
    "num-prompts": 1800,
    "request-rate": 15
  },{
    "num-prompts": 2400,
    "request-rate": 20
  },{
    "num-prompts": 1000,
    "request-rate": "inf"
  }
]

# sweep-qknorm-fusion.json
{
  "fused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": True
      }
    }
  },
  "unfused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": False
      }
    }
  }
}

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor Guide

Research-Richtung: Profiling des fusionierten QK Norm und RoPE Kernels auf H100, um die Leistung mit der unfusionierten Version zu vergleichen. Identifizieren Sie den Engpass (z.B. Speicherzugriffsmuster, Rechenauslastung) und überlegen Sie, ob die Fusion optimiert oder für bestimmte GPUs deaktiviert werden muss.
Tech Stack: python
Domain: aimachine learningperformance
Issue Type: Performance
Schwierigkeit: 4
Geschätzte Zeit: 3-5 Tage
Aktivitätsstatus: Aktiv
Klarheit: Klar
Voraussetzungen: PythonGPU computing conceptsvllm internals
Einsteigerfreundlichkeit: 25

Repository-Metriken

Beschreibung

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Contributor Guide

Erhalte frische Easy Issues per E-Mail.