KeyError: 'model.layers.14.mlp.shared_expert.gate_gate_up_proj.weight'
Aperta il 13 nov 2025
Descrizione
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
📝 Proposed GitHub Issue
Here is the draft for your SGLang Issue. Please copy and paste this content.
🐛 Bug: KeyError when loading Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (MoE model)
Describe the Bug
I am encountering a KeyError when attempting to launch the SGLang server using the latest Docker image (lmsysorg/sglang:latest) to load the Qwen3-Next-80B-A3B-Instruct-AWQ-4bit model. The model is an AWQ-quantized Mixture of Experts (MoE) architecture.
The error occurs during the model weight loading phase, specifically when SGLang's internal model runner (qwen3_next.py) tries to access a specific weight key that seems to be mismatched with the actual MoE structure of the provided checkpoint.
Steps to Reproduce
- Model Used: The local model folder contains the weights for
cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit. - Environment: The server is launched in a Docker container with an RTX Pro 6000 Blackwell GPU (though the specific GPU is likely irrelevant to the
KeyError). - Launch Command: The following bash command is executed:
docker run \
--name sglang-qwen-80b \
--gpus '"device=2"' \
--shm-size 96g \
--runtime nvidia \
--ipc=host \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /path/to/your/model/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/model \
-e CUDA_VISIBLE_DEVICES=0 \
-e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path /model \
--host 0.0.0.0 \
--port 30000 \
--tp-size 1 \
--context-length 262144 \
--mem-fraction-static 0.8 \
--trust-remote-code
Expected Behavior
The SGLang server should successfully load the model weights, initialize the runtime, and start listening on port 30000.
Actual Behavior (Error Log)
The process fails immediately with a KeyError during model loading:
Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s]
[2025-11-13 16:21:00] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2802, in run_scheduler_process
scheduler = Scheduler(
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
self.tp_worker = TpModelWorker(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 237, in __init__
self._model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 322, in __init__
self.initialize(min_per_gpu_memory)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 398, in initialize
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 752, in load_model
self.model = get_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
return loader.load_model(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 599, in load_model
self.load_weights_and_postprocess(
File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 607, in load_weights_and_postprocess
model.load_weights(weights)
File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_next.py", line 1009, in load_weights
param = params_dict[name]
KeyError: 'model.layers.14.mlp.shared_expert.gate_gate_up_proj.weight'
[2025-11-13 16:21:00] Received sigquit from a child process. It usually means the child failed.
Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s]
Additional Context
- The model weights are confirmed to be valid. I successfully launched the exact same model using the vLLM framework on the same machine, confirming the integrity of the model files:
docker run -d --name wanli \ --gpus '"device=2"' --ipc=host \ -p 6000:6000 \ -v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \ vllm/vllm-openai:latest \ --model /models --served-model-name cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \ --tensor-parallel-size 1 --host 0.0.0.0 --port 6000 \ --max-model-len 65536 --gpu-memory-utilization 0.90 - The
KeyErrorindicates an incompatibility between SGLang's expected key names for the MoE layers (specifically relating to the shared expert/gate projection) and the actual key names present in the AWQ-quantized Qwen3-Next model weights. This is likely a bug in the MoE weight loading logic withinqwen3_next.pyfor this specific model variant.
Reproduction
docker run \ --name sglang-qwen-80b \ --gpus '"device=2"' \ --shm-size 96g \ --runtime nvidia \ --ipc=host \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v /path/to/your/model/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/model \ -e CUDA_VISIBLE_DEVICES=0 \ -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path /model \ --host 0.0.0.0 \ --port 30000 \ --tp-size 1 \ --context-length 262144 \ --mem-fraction-static 0.8 \ --trust-remote-code
Environment
docker run \ --name sglang-qwen-80b \ --gpus '"device=2"' \ --shm-size 96g \ --runtime nvidia \ --ipc=host \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v /path/to/your/model/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/model \ -e CUDA_VISIBLE_DEVICES=0 \ -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path /model \ --host 0.0.0.0 \ --port 30000 \ --tp-size 1 \ --context-length 262144 \ --mem-fraction-static 0.8 \ --trust-remote-code