KeyError: 'model.layers.14.mlp.shared_expert.gate_gate_up_proj.weight' · sgl-project/sglang#13214

Metriche repository

Star: (28.442 star)
Metriche merge PR: (Merge medio 2g 1h) (1000 PR mergiate in 30 g)

Descrizione

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

📝 Proposed GitHub Issue

Here is the draft for your SGLang Issue. Please copy and paste this content.

🐛 Bug: `KeyError` when loading Qwen3-Next-80B-A3B-Instruct-AWQ-4bit (MoE model)

Describe the Bug

I am encountering a KeyError when attempting to launch the SGLang server using the latest Docker image (lmsysorg/sglang:latest) to load the Qwen3-Next-80B-A3B-Instruct-AWQ-4bit model. The model is an AWQ-quantized Mixture of Experts (MoE) architecture.

The error occurs during the model weight loading phase, specifically when SGLang's internal model runner (qwen3_next.py) tries to access a specific weight key that seems to be mismatched with the actual MoE structure of the provided checkpoint.

Steps to Reproduce

Model Used: The local model folder contains the weights for cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit.
Environment: The server is launched in a Docker container with an RTX Pro 6000 Blackwell GPU (though the specific GPU is likely irrelevant to the KeyError).
Launch Command: The following bash command is executed:

docker run \
  --name sglang-qwen-80b \
  --gpus '"device=2"' \
  --shm-size 96g \
  --runtime nvidia \
  --ipc=host \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /path/to/your/model/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/model \
  -e CUDA_VISIBLE_DEVICES=0 \
  -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
  lmsysorg/sglang:latest \
  python3 -m sglang.launch_server \
  --model-path /model \
  --host 0.0.0.0 \
  --port 30000 \
  --tp-size 1 \
  --context-length 262144 \
  --mem-fraction-static 0.8 \
  --trust-remote-code

Expected Behavior

The SGLang server should successfully load the model weights, initialize the runtime, and start listening on port 30000.

Actual Behavior (Error Log)

The process fails immediately with a KeyError during model loading:

Loading safetensors checkpoint shards:    0% Completed | 0/10 [00:00<?, ?it/s]
[2025-11-13 16:21:00] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2802, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 311, in __init__
    self.tp_worker = TpModelWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 237, in __init__
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 322, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 398, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 752, in load_model
    self.model = get_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 599, in load_model
    self.load_weights_and_postprocess(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 607, in load_weights_and_postprocess
    model.load_weights(weights)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_next.py", line 1009, in load_weights
    param = params_dict[name]
KeyError: 'model.layers.14.mlp.shared_expert.gate_gate_up_proj.weight'

[2025-11-13 16:21:00] Received sigquit from a child process. It usually means the child failed.
Loading safetensors checkpoint shards:    0% Completed | 0/10 [00:00<?, ?it/s]

Additional Context

The model weights are confirmed to be valid. I successfully launched the exact same model using the vLLM framework on the same machine, confirming the integrity of the model files:

docker run -d --name wanli \
  --gpus '"device=2"' --ipc=host \
  -p 6000:6000 \
  -v ./hf_hub/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/models:ro \
  vllm/vllm-openai:latest \
  --model /models --served-model-name cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
  --tensor-parallel-size 1 --host 0.0.0.0 --port 6000 \
  --max-model-len 65536 --gpu-memory-utilization 0.90

The KeyError indicates an incompatibility between SGLang's expected key names for the MoE layers (specifically relating to the shared expert/gate projection) and the actual key names present in the AWQ-quantized Qwen3-Next model weights. This is likely a bug in the MoE weight loading logic within qwen3_next.py for this specific model variant.

Reproduction

docker run \ --name sglang-qwen-80b \ --gpus '"device=2"' \ --shm-size 96g \ --runtime nvidia \ --ipc=host \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v /path/to/your/model/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/model \ -e CUDA_VISIBLE_DEVICES=0 \ -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path /model \ --host 0.0.0.0 \ --port 30000 \ --tp-size 1 \ --context-length 262144 \ --mem-fraction-static 0.8 \ --trust-remote-code

Environment

docker run \ --name sglang-qwen-80b \ --gpus '"device=2"' \ --shm-size 96g \ --runtime nvidia \ --ipc=host \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v /path/to/your/model/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit:/model \ -e CUDA_VISIBLE_DEVICES=0 \ -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path /model \ --host 0.0.0.0 \ --port 30000 \ --tp-size 1 \ --context-length 262144 \ --mem-fraction-static 0.8 \ --trust-remote-code

Guida contributor

Direzione di ricerca: Ispeziona la funzione di caricamento dei pesi in qwen3 next.py e confronta la chiave prevista 'model.layers.14.mlp.shared expert.gate gate up proj.weight' con le chiavi effettive nel checkpoint del modello per identificare la discrepanza.
Tech stack: python
Dominio: backend
Tipo issue: Bug
Difficoltà: 2
Tempo stimato: 1-3 ore
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: Python
Adatta ai principianti: 70