[Bug] Cannot load qwen3-vl series with lora adapter on vllm.
#3,560 建立於 2025年11月6日
描述
I fine-tuned the Qwen3-VL-8B-Instruct model using Unsloth.
My code is 99% identical to the official guide; the only change I made was replacing the 8B model in the guide with the 2B model for fine-tuning.
After fine-tuning, I confirmed that the QLoRA adapter was saved correctly.
Excited and happy, I moved the saved QLoRA adapter and the Qwen3-VL-2B-Instruct model to my vLLM server.
Then I ran a command to start model serving with vLLM as shown below. (For reference, the vLLM server has no issues—it was already serving official Qwen3-VL models.)
command = [
sys.executable,
"-m", "vllm.entrypoints.openai.api_server",
"--model", "./Qwen3-VL-2B-Instruct",
"--max_model_len", "3500",
"--gpu_memory_utilization", "0.85",
"--trust-remote-code",
"--host", "0.0.0.0",
"--port", "8888",
# for lora adapter
"--enable-lora",
"--max-lora-rank", "16", # LoRA rank
"--max-loras", "1",
"--max-cpu-loras", "1",
"--lora-modules", "adapter0=./my_lora_adapter"
]
I waited for vLLM to properly load the QLoRA adapter, but the following problem occurred. This same issue happened even when I retrained LoRA using Unsloth with 2B, 4B, and 8B models.
When I was feeling hopeless, I tried merging the model instead of saving the LoRA adapter separately by using the save_pretrained_merged() function as shown below, and then vLLM was able to load and perform inference normally:
save_pretrained_merged( f"my_16bit_model", tokenizer, save_method="merged_16bit")
However, I don't want to merge the models—I want to load only the LoRA adapter. I’ve seen many posts from others experiencing the same error. As of now, what can I do to resolve this issue?