Issues with VLLM Integration Speedup · lm-sys/FastChat#2362

(4 评论) (0 反应) (0 负责人)Python (4,736 fork)batch import

good first issue

仓库指标

Star: (38,959 star)
PR 合并指标: (30 天内没有已合并 PR)

描述

Hello,

I've been trying to work with the [vLLM integration] and I'm facing some performance discrepancies. According to the documentation, I should achieve a significant speedup, but in my tests, I'm seeing different results:

Directly running with FastChat: 16 t/s Using the VLLM integration: 25 t/s (only 1.5x speedup) VLLM offline inference: 90 t/s (expected 6x speedup) I'm running Vicuna33b on a gin H100 gpu. Has anyone experienced this before? Are there any additional configurations or tweaks I might be missing to get the desired speedup?

Thanks in advance for any guidance or advice!

贡献者指南

研究方向: 比较FastChat集成和离线vLLM推理使用的配置。检查FastChat是否使用了不同的批处理大小、模型加载参数或分词器设置导致开销。同时，确认使用了相同的模型版本和GPU设置。
技术栈: python
领域: backend
议题类型: 缺陷
难度: 2
预计时间: 1-3 小时
活动状态: 活跃
清晰度: 基本清晰
前置要求: PythonvLLM
新手友好度: 65

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。