[Bug] Cannot load qwen3-vl series with lora adapter on vllm. · unslothai/unsloth#3560

(5 commenti) (0 reazioni) (0 assegnatari)Python (5658 fork)batch import

good first issue

Metriche repository

Star: (64.271 star)
Metriche merge PR: (Merge medio 3g 15h) (525 PR mergiate in 30 g)

Descrizione

I fine-tuned the Qwen3-VL-8B-Instruct model using Unsloth. My code is 99% identical to the official guide; the only change I made was replacing the 8B model in the guide with the 2B model for fine-tuning. After fine-tuning, I confirmed that the QLoRA adapter was saved correctly.

Excited and happy, I moved the saved QLoRA adapter and the Qwen3-VL-2B-Instruct model to my vLLM server. Then I ran a command to start model serving with vLLM as shown below. (For reference, the vLLM server has no issues—it was already serving official Qwen3-VL models.)

command = [
        sys.executable, 
        "-m", "vllm.entrypoints.openai.api_server",
        "--model", "./Qwen3-VL-2B-Instruct",
        "--max_model_len", "3500",
        "--gpu_memory_utilization", "0.85",
        "--trust-remote-code",
        "--host", "0.0.0.0",
        "--port", "8888",

        # for lora adapter
        "--enable-lora",
        "--max-lora-rank", "16",  # LoRA rank
        "--max-loras", "1", 
        "--max-cpu-loras", "1",
        "--lora-modules", "adapter0=./my_lora_adapter"
]

I waited for vLLM to properly load the QLoRA adapter, but the following problem occurred. This same issue happened even when I retrained LoRA using Unsloth with 2B, 4B, and 8B models.

When I was feeling hopeless, I tried merging the model instead of saving the LoRA adapter separately by using the save_pretrained_merged() function as shown below, and then vLLM was able to load and perform inference normally:

save_pretrained_merged( f"my_16bit_model", tokenizer, save_method="merged_16bit")

However, I don't want to merge the models—I want to load only the LoRA adapter. I’ve seen many posts from others experiencing the same error. As of now, what can I do to resolve this issue?

Guida contributor

Direzione di ricerca: Indaga la logica di caricamento LoRA di vLLM per i modelli multimodali (visione linguaggio). Controlla se i pesi dell'adattatore corrispondono all'architettura del modello (es. Qwen3 VL). Confronta il percorso del modello unito riuscito con quello fallito del solo LoRA.
Tech stack: pythonpytorch
Dominio: ai
Tipo issue: Bug
Difficoltà: 3
Tempo stimato: Mezza giornata
Stato attività: Attiva
Chiarezza: Abbastanza chiara
Prerequisiti: PythonGit
Adatta ai principianti: 60

Metriche repository

Descrizione

Guida contributor

Ricevi issue Easy fresche nella tua inbox.