Issues with VLLM Integration Speedup · lm-sys/FastChat#2362

(4 comments) (0 reactions) (0 assignees)Python (4,736 forks)batch import

good first issue

Repository metrics

Stars: (38,959 stars)
PR merge metrics: (30d に merged PR はありません)

説明

Hello,

I've been trying to work with the [vLLM integration] and I'm facing some performance discrepancies. According to the documentation, I should achieve a significant speedup, but in my tests, I'm seeing different results:

Directly running with FastChat: 16 t/s Using the VLLM integration: 25 t/s (only 1.5x speedup) VLLM offline inference: 90 t/s (expected 6x speedup) I'm running Vicuna33b on a gin H100 gpu. Has anyone experienced this before? Are there any additional configurations or tweaks I might be missing to get the desired speedup?

Thanks in advance for any guidance or advice!

コントリビューターガイド

調査方針: FastChat統合とオフラインvLLM推論で使用される設定を比較してください。FastChatが異なるバッチサイズ、モデル読み込みパラメータ、トークナイザ設定を使用してオーバーヘッドを引き起こしていないか確認してください。また、同じモデルバージョンとGPU設定が使用されていることを確認してください。
技術スタック: python
領域: backend
Issue 種別: バグ
難度: 2
推定時間: 1-3時間
活動状況: アクティブ
明確さ: おおむね明確
前提条件: PythonvLLM
初心者向け度: 65

Repository metrics

説明

コントリビューターガイド

新着 Easy issues をメールで受け取る。