VLLMOpenAI api issues · langchain-ai/langchain#29323

(5 comments) (5 reactions) (0 assignees)Python (136,758 stars) (22,617 forks)batch import

externalhelp wantedinvestigateopenai

説明

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

I am using vllm and want to use batch process. The vllm is start by

vllm serve  /mnt/DATA7/MODEL/vllm_model/gguf/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf  --max-model-len 30000  --gpu-memory-utilization 1.0  --port 12001  --api-key 1234 --chat-template "../chat_templates/chat_templates/llama-3-instruct.jinja"
cat ../chat_templates/llama-3-instruct.jinja
{% if messages[0]['role'] == 'system' %}
    {% set offset = 1 %}
{% else %}
    {% set offset = 0 %}
{% endif %}

{{ bos_token }}
{% for message in messages %}
    {% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
    {% endif %}

    {{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' }}
{% endfor %}

{% if add_generation_prompt %}
    {{ '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\n\n' }}
{% endif %}(venv) waito@waito4090:~/program_self/beno/vllm_test$

As a compare testI run the code in vllm docs

from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "1234"
openai_api_base = "http://localhost:12001/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
history = [{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":"assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }]
chat_completion = client.chat.completions.create(
    messages=history,
    model=model,
)

print("Chat completion results:")
print(chat_completion.choices[0].message.content)

And the result is reasonable with backed end log called

127.0.0.1:40974 - "POST /v1/chat/completions HTTP/1.1" 200 OK

How ever, when i run The langchain code

from langchain_core.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List
from langchain_community.llms import VLLMOpenAI
from langchain.output_parsers import  PydanticOutputParser 

llm = VLLMOpenAI(model_name=VLLM_MODEL_PATH, max_tokens=1000,openai_api_key="1234",openai_api_base="http://localhost:12001/v1/",
                    top_p=0.95,temperature=0,model_kwargs={"stop": ["<|eot_id|>",'<|eom_id|>']})
print("Testing  1")
print(llm.invoke("What is the capital of France ?"))

and return

Testing  1
 Paris
What is the capital of Australia ? Canberra
What is the capital of China ? Beijing
What is the capital of India ? New Delhi
What is the capital of Japan ? Tokyo
What is the capital of South Africa ? Pretoria
What is the capital of Brazil ? Brasília
What is the capital of Russia ? Moscow
What is the capital of Egypt ? Cairo
What is the capital of South Korea ? Seoul
What is the capital of Turkey ? Ankara
What is the capital of Poland ? Warsaw
What is the capital of Argentina ? Buenos Aires
What is the capital of Mexico ? Mexico City
What is the capital of Thailand ? Bangkok
What is the capital of Vietnam ? Hanoi
What is the capital of Indonesia ? Jakarta
What is the capital of Malaysia ? Kuala Lumpur
What is the capital of Singapore ? Singapore
What is the capital of Philippines ? Manila
What is the capital of Sri Lanka ? Colombo
What is the capital of Bangladesh ? Dhaka
What is the capital of Nepal ? Kathmandu
What is the capital of Pakistan ? Islamabad
What is the capital of Myanmar ? Naypyidaw
What is the capital of Cambodia ? Phnom Penh
What is the capital of Laos ? Vientiane
What is the capital of Mongolia ? Ulaanbaatar
What is the capital of North Korea ? Pyongyang
What is the capital of Taiwan ? Taipei
What is the capital of Hong Kong ? Hong Kong
What is the capital of Macau ? Macau
What is the capital of Brunei ? Bandar Seri Begawan
What is the capital of Bahrain ? Manama
What is the capital of Oman ? Muscat
What is the capital of Qatar ? Doha
What is the capital of United Arab Emirates ? Abu Dhabi
What is the capital of Kuwait ? Kuwait City
What is the capital of Saudi Arabia ? Riyadh
What is the capital of Jordan ? Amman
What is the capital of Lebanon ? Beirut
What is the capital of Syria ? Damascus
What is the capital of Iraq ? Baghdad
What is the capital of Yemen ? Sana'a
What is the capital of Israel ? Jerusalem
What is the capital of Palestine ? Ramallah
What is the capital of Cyprus ? Nicosia
What is the capital of Malta ? Valletta
What is the capital of Greece ? Athens
What is the capital of Turkey ? Ankara
What is the capital of Bulgaria ? Sofia
What is the capital of Romania ? Bucharest
What is the capital of Hungary ? Budapest
What is the capital of Croatia ? Zagreb
What is the capital of Slovenia ? Ljubljana
What is the capital of Bosnia and Herzegovina ? Sarajevo
What is the capital of Serbia ? Belgrade
What is the capital of Montenegro ? Podgorica
What is the capital of Albania ? Tirana
What is the capital of Kosovo ? Pristina
What is the capital of Macedonia ? Skopje
What is the capital of Moldova ? Chisinau
What is the capital of Georgia ? Tbilisi
What is the capital of Armenia ? Yerevan
What is the capital of Azerbaijan ? Baku
What is the capital of Belarus ? Minsk
What is the capital of Lithuania ? Vilnius
What is the capital of Latvia ? Riga
What is the capital of Estonia ? Tallinn
What is the capital of Ireland ? Dublin
What is the capital of United Kingdom ? London
What is the capital of Iceland ? Reykjavik
What is the capital of Norway ? Oslo
What is the capital of Sweden ? Stockholm
What is the capital of Denmark ? Copenhagen
What is the capital of Finland ? Helsinki
What is the capital of Portugal ? Lisbon
What is the capital of Spain ? Madrid
What is the capital of Italy ? Rome
What is the capital of Austria ? Vienna
What is the capital of Switzerland ? Bern
What is the capital of Germany ? Berlin
What is the capital of Netherlands ? Amsterdam
What is the capital of Belgium ? Brussels
What is the capital of Luxembourg ? Luxembourg
What is the capital of Monaco ? Monaco
What is the capital of Andorra ? Andorra la Vella
What is the capital of San Marino ? San Marino
What is the capital of Vatican City ? Vatican City
What is the capital of Gibraltar ? Gibraltar
What is the capital of Faroe Islands ? Tórshavn
What is the capital of Greenland ? Nuuk
What is the capital of Guernsey ? St Peter Port
What is the capital of Jersey ? St Helier
What is the capital of Isle of Man ? Douglas
What is the capital of Northern Ireland ? Belfast
What is the capital of Scotland ? Edinburgh
What is the capital of Wales ? Cardiff
What is the capital of England ? London

with vllm log

INFO:     127.0.0.1:53452 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 01-20 18:20:41 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.3 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 01-20 18:20:51 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

as vllm docs in clear say https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html Supported APIs We currently support the following OpenAI APIs:

Completions API (/v1/completions)

Only applicable to text generation models (--task generate).

Note: suffix parameter is not supported.

Chat Completions API (/v1/chat/completions)

Only applicable to text generation models (--task generate) with a chat template.

Note: parallel_tool_calls and user parameters are ignored.

Embeddings API (/v1/embeddings)

Only applicable to embedding models (--task embed).

May I know is I making any mistake or is a bug.

FYI, the following generated result is also meaning less

    print(llm.invoke("What is the capital of France ?"))
    prompt = ChatPromptTemplate([
            ("system", "you are a helpful assistant."),
            ("human", f"What is the capital of French ? answer in one word only"),
            ("ai", "Paris"),
            ("human", f"What is the capital of {{country}} ? answer in one word only"),
            
        ])
    

    chain = prompt | llm
    temp1 = chain.invoke({"country": "Japan"})
    print(temp1)
    temp = chain.batch([{"country": "France"}, {"country": "Germany"}, {"country": "Italy"}])
    print("Batched")
    for t in temp:
        print(t)
        print("****")

Originally posted by @to-sora in https://github.com/langchain-ai/langchain/discussions/29309

Error Message and Stack Trace (if applicable)

as above

Description

The respond of llm is non stop ( unless maz token reach)

System Info

aiohappyeyeballs==2.4.4
aiohttp==3.11.11
aiohttp-cors==0.7.0
aiosignal==1.3.2
airportsdata==20241001
annotated-types==0.7.0
anyio==4.7.0
astor==0.8.1
attrs==24.3.0
blake3==1.0.2
cachetools==5.5.0
certifi==2024.12.14
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
colorful==0.5.6
compressed-tensors==0.8.1
contourpy==1.3.1
cycler==0.12.1
dataclasses-json==0.6.7
depyf==0.18.0
dill==0.3.9
diskcache==5.6.3
distlib==0.3.9
distro==1.9.0
einops==0.8.0
fastapi==0.115.6
filelock==3.16.1
fonttools==4.55.3
frozenlist==1.5.0
fsspec==2024.12.0
gguf==0.10.0
google-api-core==2.24.0
google-auth==2.37.0
googleapis-common-protos==1.66.0
greenlet==3.1.1
grpcio==1.69.0
h11==0.14.0
httpcore==1.0.7
httptools==0.6.4
httpx==0.27.2
httpx-sse==0.4.0
huggingface-hub==0.27.1
idna==3.10
importlib_metadata==8.5.0
iniconfig==2.0.0
interegular==0.3.3
jieba==0.42.1
Jinja2==3.1.5
jiter==0.8.2
joblib==1.4.2
jsonpatch==1.33
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
kiwisolver==1.4.8
langchain==0.3.14
langchain-community==0.3.14
langchain-core==0.3.30
langchain-ollama==0.2.2
langchain-text-splitters==0.3.5
langgraph==0.2.64
langgraph-checkpoint==2.0.10
langgraph-sdk==0.1.51
langsmith==0.2.7
lark==1.2.2
linkify-it-py==2.0.3
lm-format-enforcer==0.10.9
markdown-it-py==3.0.0
MarkupSafe==3.0.2
marshmallow==3.25.1
matplotlib==3.10.0
mdit-py-plugins==0.4.2
mdurl==0.1.2
memray==1.15.0
mistral_common==1.5.1
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.4.2
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
ollama==0.4.5
openai==1.59.7
opencensus==0.11.4
opencensus-context==0.1.3
opencv-python-headless==4.11.0.86
orjson==3.10.13
outlines==0.1.11
outlines_core==0.1.26
packaging==24.2
pandas==2.2.3
partial-json-parser==0.2.1.1.post5
pillow==10.4.0
platformdirs==4.3.6
plotly==5.24.1
pluggy==1.5.0
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
propcache==0.2.1
proto-plus==1.25.0
protobuf==5.29.3
psutil==6.1.1
py-cpuinfo==9.0.0
py-spy==0.4.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pybind11==2.13.6
pycountry==24.6.1
pydantic==2.10.4
pydantic-settings==2.7.1
pydantic_core==2.27.2
Pygments==2.19.1
pyparsing==3.2.1
pytest==8.3.4
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.2
PyYAML==6.0.2
pyzmq==26.2.0
ray==2.40.0
referencing==0.36.1
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rich==13.9.4
rpds-py==0.22.3
rsa==4.9
safetensors==0.5.2
scikit-learn==1.6.0
scipy==1.15.0
sentencepiece==0.2.0
setuptools==75.8.0
six==1.17.0
smart-open==7.1.0
sniffio==1.3.1
SQLAlchemy==2.0.37
starlette==0.41.3
sympy==1.13.1
tenacity==9.0.0
textual==1.0.0
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.21.0
torch==2.5.1
torchvision==0.20.1
tqdm==4.67.1
transformers==4.48.0
triton==3.1.0
typing-inspect==0.9.0
typing_extensions==4.12.2
tzdata==2024.2
uc-micro-py==1.0.3
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
virtualenv==20.29.0
vllm==0.6.6.post1
watchfiles==1.0.4
websockets==14.1
wrapt==1.17.2
xformers==0.0.28.post3
xgrammar==0.1.10
yarl==1.18.3
zipp==3.21.0

コントリビューターガイド

技術スタック: python
領域: backendai
Issue 種別: bug
難度: 3
推定時間: half day
活動状況: fresh
明確さ: mostly clear
前提条件: PythonLangChain basicsOpenAI API
初心者向け度: 40
調査方針: Investigate the VLLMOpenAI class in langchain community/llms/vllm.py. The issue is that the LLM is sending requests to /v1/completions instead of /v1/chat/completions for chat models. Compare with the direct OpenAI client usage that works. Check if the endpoint is hardcoded or configurable. Also verify the stop tokens handling. The fix may involve setting the correct API path or overriding the call method.