lm-sys/FastChat

Issues with VLLM Integration Speedup

Open

#2,362 创建于 2023年9月5日

在 GitHub 查看
 (4 评论) (0 反应) (0 负责人)Python (38,959 star) (4,736 fork)batch import
good first issue

描述

Hello,

I've been trying to work with the [vLLM integration] and I'm facing some performance discrepancies. According to the documentation, I should achieve a significant speedup, but in my tests, I'm seeing different results:

Directly running with FastChat: 16 t/s Using the VLLM integration: 25 t/s (only 1.5x speedup) VLLM offline inference: 90 t/s (expected 6x speedup) I'm running Vicuna33b on a gin H100 gpu. Has anyone experienced this before? Are there any additional configurations or tweaks I might be missing to get the desired speedup?

Thanks in advance for any guidance or advice!

贡献者指南

Issues with VLLM Integration Speedup · lm-sys/FastChat#2362 | Good First Issue