lm-sys/FastChat

Issues with VLLM Integration Speedup

Open

#2,362 opened on Sep 5, 2023

View on GitHub
 (4 comments) (0 reactions) (0 assignees)Python (38,959 stars) (4,736 forks)batch import
good first issue

Description

Hello,

I've been trying to work with the [vLLM integration] and I'm facing some performance discrepancies. According to the documentation, I should achieve a significant speedup, but in my tests, I'm seeing different results:

Directly running with FastChat: 16 t/s Using the VLLM integration: 25 t/s (only 1.5x speedup) VLLM offline inference: 90 t/s (expected 6x speedup) I'm running Vicuna33b on a gin H100 gpu. Has anyone experienced this before? Are there any additional configurations or tweaks I might be missing to get the desired speedup?

Thanks in advance for any guidance or advice!

Contributor guide