Issues with VLLM Integration Speedup · lm-sys/FastChat#2362

(4 comments) (0 reactions) (0 assignees)Python (4,736 forks)batch import

good first issue

Repository metrics

Stars: (38,959 stars)
PR merge metrics: (No merged PRs in 30d)

Description

Hello,

I've been trying to work with the [vLLM integration] and I'm facing some performance discrepancies. According to the documentation, I should achieve a significant speedup, but in my tests, I'm seeing different results:

Directly running with FastChat: 16 t/s Using the VLLM integration: 25 t/s (only 1.5x speedup) VLLM offline inference: 90 t/s (expected 6x speedup) I'm running Vicuna33b on a gin H100 gpu. Has anyone experienced this before? Are there any additional configurations or tweaks I might be missing to get the desired speedup?

Thanks in advance for any guidance or advice!

Contributor guide

Research direction: Compare the configuration used in FastChat integration vs offline vLLM inference. Check if FastChat is using different batch sizes, model loading parameters, or tokenizer settings that could cause overhead. Also, verify that the same model version and GPU settings are used.
Tech stack: python
Domain: backend
Issue type: Bug
Difficulty: 2
Estimated time: 1-3 hours
Activity status: Active
Clarity: Mostly clear
Prerequisites: PythonvLLM
Newbie friendliness: 65

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.