Issues with VLLM Integration Speedup · lm-sys/FastChat#2362

(4 留言) (0 反應) (0 負責人)Python (4,736 fork)batch import

good first issue

倉庫指標

Star: (38,959 star)
PR 合併指標: (30 天內沒有已合併 PR)

描述

Hello,

I've been trying to work with the [vLLM integration] and I'm facing some performance discrepancies. According to the documentation, I should achieve a significant speedup, but in my tests, I'm seeing different results:

Directly running with FastChat: 16 t/s Using the VLLM integration: 25 t/s (only 1.5x speedup) VLLM offline inference: 90 t/s (expected 6x speedup) I'm running Vicuna33b on a gin H100 gpu. Has anyone experienced this before? Are there any additional configurations or tweaks I might be missing to get the desired speedup?

Thanks in advance for any guidance or advice!

貢獻者指南

研究方向: 比較FastChat整合和離線vLLM推理使用的配置。檢查FastChat是否使用了不同的批次大小、模型載入參數或分詞器設定導致開銷。同時，確認使用了相同的模型版本和GPU設定。
技術棧: python
領域: backend
議題類型: 錯誤
難度: 2
預計時間: 1-3 小時
活動狀態: 活躍
清晰度: 大致清晰
前置要求: PythonvLLM
新手友善度: 65

倉庫指標

描述

貢獻者指南

每天在信箱收到新鮮 Easy issues。