[Bug] Cannot load qwen3-vl series with lora adapter on vllm. · unslothai/unsloth#3560

Métricas do repositório

Stars: (64.271 stars)
Métricas de merge de PR: (Mesclagem média 3d 15h) (525 fundiu PRs em 30d)

Description

I fine-tuned the Qwen3-VL-8B-Instruct model using Unsloth. My code is 99% identical to the official guide; the only change I made was replacing the 8B model in the guide with the 2B model for fine-tuning. After fine-tuning, I confirmed that the QLoRA adapter was saved correctly.

Excited and happy, I moved the saved QLoRA adapter and the Qwen3-VL-2B-Instruct model to my vLLM server. Then I ran a command to start model serving with vLLM as shown below. (For reference, the vLLM server has no issues—it was already serving official Qwen3-VL models.)

command = [
        sys.executable, 
        "-m", "vllm.entrypoints.openai.api_server",
        "--model", "./Qwen3-VL-2B-Instruct",
        "--max_model_len", "3500",
        "--gpu_memory_utilization", "0.85",
        "--trust-remote-code",
        "--host", "0.0.0.0",
        "--port", "8888",

        # for lora adapter
        "--enable-lora",
        "--max-lora-rank", "16",  # LoRA rank
        "--max-loras", "1", 
        "--max-cpu-loras", "1",
        "--lora-modules", "adapter0=./my_lora_adapter"
]

I waited for vLLM to properly load the QLoRA adapter, but the following problem occurred. This same issue happened even when I retrained LoRA using Unsloth with 2B, 4B, and 8B models.

When I was feeling hopeless, I tried merging the model instead of saving the LoRA adapter separately by using the save_pretrained_merged() function as shown below, and then vLLM was able to load and perform inference normally:

save_pretrained_merged( f"my_16bit_model", tokenizer, save_method="merged_16bit")

However, I don't want to merge the models—I want to load only the LoRA adapter. I’ve seen many posts from others experiencing the same error. As of now, what can I do to resolve this issue?

Guia do colaborador

Direção de pesquisa: Investigue a logica de carregamento LoRA do vLLM para modelos multimodais (visão linguagem). Verifique se os pesos do adaptador correspondem à arquitetura do modelo (ex.: Qwen3 VL). Compare o caminho bem sucedido do modelo mesclado com o caminho com falha apenas com LoRA.
Pilha de tecnologia: pythonpytorch
Domain: ai
Tipo Issue: Bug
Difficulty: 3
Tempo estimado: Meio dia
Status da atividade: Ativo
Clarity: Principalmente claro
Prerequisites: PythonGit
Simpatia para novatos: 60

Métricas do repositório

Description

Guia do colaborador

Receba issues Easy novas por email.