[Performance]: qknorm+rope fusion slower than unfused on H100 · vllm-project/vllm#34391

(12 commenti) (1 reazione) (1 assegnatario)Python (16.816 fork)batch import

help wantedperformancetorch.compile

Metriche repository

Star: (80.034 star)
Metriche merge PR: (Merge medio 9g 2h) (921 PR mergiate in 30 g)

Descrizione

Proposal to improve performance

Running vllm bench sweep serve for -cc.pass_config.enable_qknorm_rope_fusion in {True, False} gives the following results:

# base command
vllm bench sweep serve \
    --serve-cmd "vllm serve --model $MODEL --tensor-parallel-size $TP --port $PORT --no-enable-prefix-caching" \
    --bench-cmd "vllm bench serve --dataset-name random --ignore-eos --model=$MODEL --port $PORT" \
    --bench-params sweep-qps.json \
    --serve-params sweep-qknorm-fusion.json \

# sweep-qps.json
[
  {
    "num-prompts": 120,
    "request-rate": 1
  },{
    "num-prompts": 600,
    "request-rate": 5
  },{
    "num-prompts": 1200,
    "request-rate": 10
  },{
    "num-prompts": 1800,
    "request-rate": 15
  },{
    "num-prompts": 2400,
    "request-rate": 20
  },{
    "num-prompts": 1000,
    "request-rate": "inf"
  }
]

# sweep-qknorm-fusion.json
{
  "fused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": True
      }
    }
  },
  "unfused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": False
      }
    }
  }
}

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Guida contributor

Direzione di ricerca: Analizza il profilo del kernel QK norm e RoPE fuso su H100 per confrontarne le prestazioni con la versione non fusa. Identifica il collo di bottiglia (ad esempio, pattern di accesso alla memoria, utilizzo del calcolo) e considera se la fusione necessita di ottimizzazione o se dovrebbe essere disabilitata per alcune GPU.
Tech stack: python
Dominio: aimachine learningperformance
Tipo issue: Performance
Difficoltà: 4
Tempo stimato: 3-5 giorni
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: PythonGPU computing conceptsvllm internals
Adatta ai principianti: 25