[Bug][RL]: Port Conflict · vllm-project/vllm#28498

(17 comments) (2 reactions) (0 assignees)Python (16,816 forks)batch import

buggood first issuehelp wantedstaleunstale

Repository metrics

Stars: (80,034 stars)
PR merge metrics: (Avg merge 3d 17h) (993 merged PRs in 30d)

Description

Your current environment

bug report:

Hello vLLM team, I'm running into a suspicious ZMQ socket bug with my 2P 4D configuration for DeepSeek-V3 (see below). I thought it is caused by reusing same nodes for many vLLM launches, but now it happened also at a clean node. Seems like a DP bug of sorts. Please find logs attached. vllm==0.11.0.

[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
[1;36m(APIServer pid=670293)[0;0m     self.engine_core = EngineCoreClient.make_async_mp_client(
[1;36m(APIServer pid=670293)[0;0m                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
[1;36m(APIServer pid=670293)[0;0m     return DPLBAsyncMPClient(*client_args)
[1;36m(APIServer pid=670293)[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1125, in __init__
[1;36m(APIServer pid=670293)[0;0m     super().__init__(vllm_config, executor_class, log_stats,
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 975, in __init__
[1;36m(APIServer pid=670293)[0;0m     super().__init__(vllm_config, executor_class, log_stats,
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
[1;36m(APIServer pid=670293)[0;0m     super().__init__(
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 466, in __init__
[1;36m(APIServer pid=670293)[0;0m     self.resources.output_socket = make_zmq_socket(
[1;36m(APIServer pid=670293)[0;0m                                    ^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2983, in make_zmq_socket
[1;36m(APIServer pid=670293)[0;0m     socket.bind(path)
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/zmq/sugar/socket.py", line 320, in bind
[1;36m(APIServer pid=670293)[0;0m     super().bind(addr)
[1;36m(APIServer pid=670293)[0;0m   File "zmq/backend/cython/_zmq.py", line 1009, in zmq.backend.cython._zmq.Socket.bind
[1;36m(APIServer pid=670293)[0;0m   File "zmq/backend/cython/_zmq.py", line 190, in zmq.backend.cython._zmq._check_rc
[1;36m(APIServer pid=670293)[0;0m zmq.error.ZMQError: Address already in use (addr='tcp://slurm-h200-206-017:59251')

🐛 Describe the bug

From Nick:

I think the problem is that each DP worker finds/assigns free ports dynamically/independently.. so there is a race condtion. I'm not sure of an immediate workaround apart from just re-attempt to start things when this happens. We'll have to look at how to catch and re-find a port if possible (though I have a memory this might be nontrivial).

From Reporter:

Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['tcp://slurm-h200-207-083:60613'], outputs=['tcp://slurm-h200-207-083:36865'], coordinator_input='tcp://slurm-h200-207-083:34575', coordinator_output='tcp://slurm-h200-207-083:48025', frontend_stats_publish_address='ipc:///tmp/88ec875f-3de9-46ec-9947-6d1d6573b910'), parallel_config={'data_parallel_master_ip': 'slurm-h200-207-083', 'data_parallel_master_port': 41917, '_data_parallel_master_port_list': [60545, 36835, 47971, 37001], 'data_parallel_size': 32})

I'm looking at the code and I see that all code paths for getting ports eventually to go to _get_open_port, and that in _get_open_port there is basically no defence against choosing the same port twice. Can you please confirm my understanding?

_get_open_port in main is here: https://github.com/vllm-project/vllm/blob/main/vllm/utils/network_utils.py#L177

UPD: I imagine the assumption here is that once a code path gets a port, that code path will use it immediately, and thus the port will be come busy. It doesn't seem to hold though.

Even where all sockets that vLLM chose for itself are unique, I get the stack trace below. I have the following explanation in mind:

vLLM chooses zmq ports before launching the engines
launching the engines takes ~5 mins
by the time the engines are launched, something can listen on this port, like for example Ray
It looks the right solution is to hold on to then chosen ports immediately are they are chosen.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide

Research direction: Inspect the port selection in vllm/utils/network utils.py get open port and ensure the port is held immediately after selection to avoid race conditions.
Tech stack: python
Domain: backend
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Mostly clear
Prerequisites: PythonZMQ
Newbie friendliness: 50