feature requesthelp wanted
Description
🚀 The feature, motivation and pitch
vLLM has a multiprocess architecture with:
- API Server --> EngineCore --> [N] Workers
As a result, clean error message logging is challenging, since the error in the API server that occurs will often not be the root cause error. An example of this is at startup time:
(vllm) [robertgshaw2-redhat@nm-automation-h100-standalone-1-preserve vllm]$ just launch_cutlass_tensor
VLLM_USE_DEEP_GEMM=0 VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=throughput chg run --gpus 2 -- vllm serve amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV -tp 2 --port 8002 --max-model-len 8192
Reserved 2 GPU(s): [1 3] for command execution
(APIServer pid=116718) INFO 01-04 14:48:03 [api_server.py:1277] vLLM API server version 0.13.0rc2.dev185+g00a8d7628
(APIServer pid=116718) INFO 01-04 14:48:03 [utils.py:253] non-default args: {'model_tag': 'amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV', 'port': 8002, 'model': 'amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV', 'max_model_len': 8192, 'tensor_parallel_size': 2}
(APIServer pid=116718) INFO 01-04 14:48:04 [model.py:522] Resolved architecture: MixtralForCausalLM
(APIServer pid=116718) INFO 01-04 14:48:04 [model.py:1510] Using max model len 8192
(APIServer pid=116718) WARNING 01-04 14:48:04 [vllm.py:1453] Current vLLM config is not set.
(APIServer pid=116718) INFO 01-04 14:48:04 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=116718) INFO 01-04 14:48:04 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=116718) INFO 01-04 14:48:04 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=116718) INFO 01-04 14:48:05 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=116936) INFO 01-04 14:48:12 [core.py:96] Initializing a V1 LLM engine (v0.13.0rc2.dev185+g00a8d7628) with config: model='amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV', speculative_config=None, tokenizer='amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=116936) WARNING 01-04 14:48:12 [multiproc_executor.py:882] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-04 14:48:20 [parallel_state.py:1214] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:36779 backend=nccl
INFO 01-04 14:48:20 [parallel_state.py:1214] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36779 backend=nccl
INFO 01-04 14:48:21 [pynccl.py:111] vLLM is using nccl==2.27.5
INFO 01-04 14:48:23 [parallel_state.py:1425] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
INFO 01-04 14:48:23 [parallel_state.py:1425] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(Worker_TP0 pid=117124) INFO 01-04 14:48:24 [gpu_model_runner.py:3762] Starting to load model amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV...
(Worker_TP1 pid=117125) INFO 01-04 14:48:24 [fp8.py:157] Using FlashInfer FP8 MoE CUTLASS backend for SM90/SM100
(Worker_TP0 pid=117124) INFO 01-04 14:48:25 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
(Worker_TP0 pid=117124) INFO 01-04 14:48:25 [fp8.py:157] Using FlashInfer FP8 MoE CUTLASS backend for SM90/SM100
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] WorkerProc failed to start.
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] Traceback (most recent call last):
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 722, in worker_main
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 562, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.worker.load_model()
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/v1/worker/gpu_worker.py", line 275, in load_model
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/v1/worker/gpu_model_runner.py", line 3781, in load_model
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.model = model_loader.load_model(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] model = initialize_model(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/model_loader/utils.py", line 48, in initialize_model
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/models/mixtral.py", line 508, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.model = MixtralModel(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/compilation/decorators.py", line 291, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] old_init(self, **kwargs)
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/models/mixtral.py", line 319, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/models/utils.py", line 606, in make_layers
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/models/mixtral.py", line 321, in <lambda>
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] lambda prefix: MixtralDecoderLayer(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/models/mixtral.py", line 257, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.block_sparse_moe = MixtralMoE(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/models/mixtral.py", line 129, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.experts = FusedMoE(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 588, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] self.quant_method: FusedMoEMethodBase = _get_quant_method()
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 580, in _get_quant_method
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] quant_method = self.quant_config.get_quant_method(self, prefix)
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/layers/quantization/fp8.py", line 347, in get_quant_method
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] moe_quant_method = Fp8MoEMethod(self, layer)
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] File "/home/robertgshaw2-redhat/vllm/vllm/model_executor/layers/quantization/fp8.py", line 762, in __init__
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] raise NotImplementedError(
(Worker_TP1 pid=117125) ERROR 01-04 14:48:25 [multiproc_executor.py:751] NotImplementedError: FlashInfer CUTLASS FP8 MoE backend does custom routing function or renormalization, but got True and None.
(Worker_TP1 pid=117125) INFO 01-04 14:48:25 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP0 pid=117124) INFO 01-04 14:48:25 [multiproc_executor.py:709] Parent process exited, terminating worker
[rank0]:[W104 14:48:25.532969573 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] EngineCore failed to start.
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] Traceback (most recent call last):
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core.py", line 886, in run_engine_core
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core.py", line 651, in __init__
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] super().__init__(
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] super().__init__(vllm_config)
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] self._init_executor()
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 172, in _init_executor
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 660, in wait_for_ready
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] raise e from None
(EngineCore_DP0 pid=116936) ERROR 01-04 14:48:27 [core.py:895] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=116936) Process EngineCore_DP0:
(EngineCore_DP0 pid=116936) Traceback (most recent call last):
(EngineCore_DP0 pid=116936) File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=116936) self.run()
(EngineCore_DP0 pid=116936) File "/usr/lib64/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=116936) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core.py", line 899, in run_engine_core
(EngineCore_DP0 pid=116936) raise e
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core.py", line 886, in run_engine_core
(EngineCore_DP0 pid=116936) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=116936) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core.py", line 651, in __init__
(EngineCore_DP0 pid=116936) super().__init__(
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core.py", line 105, in __init__
(EngineCore_DP0 pid=116936) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=116936) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 97, in __init__
(EngineCore_DP0 pid=116936) super().__init__(vllm_config)
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=116936) self._init_executor()
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 172, in _init_executor
(EngineCore_DP0 pid=116936) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=116936) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=116936) File "/home/robertgshaw2-redhat/vllm/vllm/v1/executor/multiproc_executor.py", line 660, in wait_for_ready
(EngineCore_DP0 pid=116936) raise e from None
(EngineCore_DP0 pid=116936) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=116718) Traceback (most recent call last):
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=116718) sys.exit(main())
(APIServer pid=116718) ^^^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=116718) args.dispatch_function(args)
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=116718) uvloop.run(run_server(args))
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/.venv/lib64/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=116718) return __asyncio.run(
(APIServer pid=116718) ^^^^^^^^^^^^^^
(APIServer pid=116718) File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=116718) return runner.run(main)
(APIServer pid=116718) ^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=116718) return self._loop.run_until_complete(task)
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/.venv/lib64/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=116718) return await main
(APIServer pid=116718) ^^^^^^^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/entrypoints/openai/api_server.py", line 1324, in run_server
(APIServer pid=116718) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker
(APIServer pid=116718) async with build_async_engine_client(
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=116718) return await anext(self.gen)
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/entrypoints/openai/api_server.py", line 171, in build_async_engine_client
(APIServer pid=116718) async with build_async_engine_client_from_engine_args(
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=116718) return await anext(self.gen)
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/entrypoints/openai/api_server.py", line 212, in build_async_engine_client_from_engine_args
(APIServer pid=116718) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=116718) return cls(
(APIServer pid=116718) ^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=116718) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=116718) return AsyncMPClient(*client_args)
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core_client.py", line 824, in __init__
(APIServer pid=116718) super().__init__(
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=116718) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=116718) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=116718) File "/usr/lib64/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=116718) next(self.gen)
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=116718) wait_for_engine_startup(
(APIServer pid=116718) File "/home/robertgshaw2-redhat/vllm/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=116718) raise RuntimeError(
(APIServer pid=116718) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib64/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown
The root cause is clear (the kernel does not support the model), but its very hard to see this
We need a system for proper error and crash logging.
WARNING: this is not a small project. Im not interested in a band aid for the specific issue listed above but rather in a framework for proper root cause error logging.
Thanks!
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.