[Performance]: qknorm+rope fusion slower than unfused on H100 · vllm-project/vllm#34391

(12 评论) (1 反应) (1 负责人)Python (16,816 fork)batch import

help wantedperformancetorch.compile

仓库指标

Star: (80,034 star)
PR 合并指标: (平均合并 3天 17小时) (30 天内合并 993 个 PR)

描述

Proposal to improve performance

Running vllm bench sweep serve for -cc.pass_config.enable_qknorm_rope_fusion in {True, False} gives the following results:

# base command
vllm bench sweep serve \
    --serve-cmd "vllm serve --model $MODEL --tensor-parallel-size $TP --port $PORT --no-enable-prefix-caching" \
    --bench-cmd "vllm bench serve --dataset-name random --ignore-eos --model=$MODEL --port $PORT" \
    --bench-params sweep-qps.json \
    --serve-params sweep-qknorm-fusion.json \

# sweep-qps.json
[
  {
    "num-prompts": 120,
    "request-rate": 1
  },{
    "num-prompts": 600,
    "request-rate": 5
  },{
    "num-prompts": 1200,
    "request-rate": 10
  },{
    "num-prompts": 1800,
    "request-rate": 15
  },{
    "num-prompts": 2400,
    "request-rate": 20
  },{
    "num-prompts": 1000,
    "request-rate": "inf"
  }
]

# sweep-qknorm-fusion.json
{
  "fused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": True
      }
    }
  },
  "unfused": {
    "compilation_config": {
      "pass_config": {
        "enable_qknorm_rope_fusion": False
      }
    }
  }
}

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

贡献者指南

研究方向: 在H100上对融合的QK范数和RoPE内核进行性能分析，与未融合版本进行比较。识别瓶颈（例如内存访问模式、计算利用率），并考虑融合优化是否需要对某些GPU进行优化或禁用。
技术栈: python
领域: aimachine learningperformance
议题类型: 性能
难度: 4
预计时间: 3-5 天
活动状态: 活跃
清晰度: 清晰
前置要求: PythonGPU computing conceptsvllm internals
新手友好度: 25