vllm-project/vllm

[Bug][RL]: Port Conflict

Open

#28498 opened on Nov 11, 2025

View on GitHub
 (15 comments) (2 reactions) (0 assignees)Python (80,034 stars) (16,816 forks)batch import
buggood first issuehelp wanted

Description

Your current environment

  • bug report:
Hello vLLM team, I'm running into a suspicious ZMQ socket bug with my 2P 4D configuration for DeepSeek-V3 (see below). I thought it is caused by reusing same nodes for many vLLM launches, but now it happened also at a clean node. Seems like a DP bug of sorts. Please find logs attached. vllm==0.11.0.
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
[1;36m(APIServer pid=670293)[0;0m     self.engine_core = EngineCoreClient.make_async_mp_client(
[1;36m(APIServer pid=670293)[0;0m                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
[1;36m(APIServer pid=670293)[0;0m     return DPLBAsyncMPClient(*client_args)
[1;36m(APIServer pid=670293)[0;0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1125, in __init__
[1;36m(APIServer pid=670293)[0;0m     super().__init__(vllm_config, executor_class, log_stats,
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 975, in __init__
[1;36m(APIServer pid=670293)[0;0m     super().__init__(vllm_config, executor_class, log_stats,
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
[1;36m(APIServer pid=670293)[0;0m     super().__init__(
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 466, in __init__
[1;36m(APIServer pid=670293)[0;0m     self.resources.output_socket = make_zmq_socket(
[1;36m(APIServer pid=670293)[0;0m                                    ^^^^^^^^^^^^^^^^
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2983, in make_zmq_socket
[1;36m(APIServer pid=670293)[0;0m     socket.bind(path)
[1;36m(APIServer pid=670293)[0;0m   File "XXX/.venv/lib/python3.12/site-packages/zmq/sugar/socket.py", line 320, in bind
[1;36m(APIServer pid=670293)[0;0m     super().bind(addr)
[1;36m(APIServer pid=670293)[0;0m   File "zmq/backend/cython/_zmq.py", line 1009, in zmq.backend.cython._zmq.Socket.bind
[1;36m(APIServer pid=670293)[0;0m   File "zmq/backend/cython/_zmq.py", line 190, in zmq.backend.cython._zmq._check_rc
[1;36m(APIServer pid=670293)[0;0m zmq.error.ZMQError: Address already in use (addr='tcp://slurm-h200-206-017:59251')

🐛 Describe the bug

From Nick:

I think the problem is that each DP worker finds/assigns free ports dynamically/independently.. so there is a race condtion. I'm not sure of an immediate workaround apart from just re-attempt to start things when this happens. We'll have to look at how to catch and re-find a port if possible (though I have a memory this might be nontrivial).

From Reporter:

Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['tcp://slurm-h200-207-083:60613'], outputs=['tcp://slurm-h200-207-083:36865'], coordinator_input='tcp://slurm-h200-207-083:34575', coordinator_output='tcp://slurm-h200-207-083:48025', frontend_stats_publish_address='ipc:///tmp/88ec875f-3de9-46ec-9947-6d1d6573b910'), parallel_config={'data_parallel_master_ip': 'slurm-h200-207-083', 'data_parallel_master_port': 41917, '_data_parallel_master_port_list': [60545, 36835, 47971, 37001], 'data_parallel_size': 32})

I'm looking at the code and I see that all code paths for getting ports eventually to go to _get_open_port, and that in _get_open_port there is basically no defence against choosing the same port twice. Can you please confirm my understanding?

_get_open_port in main is here: https://github.com/vllm-project/vllm/blob/main/vllm/utils/network_utils.py#L177

UPD: I imagine the assumption here is that once a code path gets a port, that code path will use it immediately, and thus the port will be come busy. It doesn't seem to hold though.

Even where all sockets that vLLM chose for itself are unique, I get the stack trace below. I have the following explanation in mind:

  • vLLM chooses zmq ports before launching the engines
  • launching the engines takes ~5 mins
  • by the time the engines are launched, something can listen on this port, like for example Ray
  • It looks the right solution is to hold on to then chosen ports immediately are they are chosen.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Contributor guide