unslothai/unsloth

[Feature Request] Add Idefics3 architecture support (Granite Docling VLM)

Open

#4079 opened on Feb 19, 2026

View on GitHub
 (3 comments) (0 reactions) (0 assignees)Python (64,271 stars) (5,658 forks)batch import
help wanted

Description

Feature Request: Idefics3 Architecture Support

Summary

Requesting native Unsloth support for the Idefics3 architecture, which would enable optimized fine-tuning of models like IBM Granite Docling VLM (258M params) — a high-performing document understanding model.

Why This Matters

Granite Docling VLM achieves 87.7 on DocVQA with only 258M parameters (vs. the original Idefics3-8B at 74.0). It's Apache 2.0 licensed and increasingly used for document conversion (PDFs, scans, slides → structured output). Unsloth support would make fine-tuning this model significantly faster and more memory-efficient, opening it up to consumer GPUs.

Architecture Analysis

Granite Docling VLM (and all Idefics3 models) uses Idefics3ForConditionalGeneration. Its components map closely to things Unsloth already supports:

Component Idefics3 / Granite Docling Unsloth Status
Vision Encoder SigLIP2-base-patch16-512 SigLIP supported in other VLMs (LLaVA, etc.)
Language Model Granite 165M (Llama 3-based) Llama fully supported
Connector Pixel Shuffle projector (4x spatial compression) Not yet in Unsloth
Model Class Idefics3ForConditionalGeneration Not registered
Config type model_type = "idefics3" with vision_config + text_config Would be detected as VLM, but lacks patches

The language model backbone is Llama-based, and the vision encoder is SigLIP — both already have Unsloth optimizations in other model families. The primary new component is the Pixel Shuffle connector that bridges vision→language.

Desired Scope

Full FastVisionModel support including:

  • SFT via SFTTrainer
  • DPO via DPOTrainer
  • GRPO / GSPO via GRPOTrainer
  • LoRA with selective layer training (finetune_vision_layers, finetune_language_layers, etc.)
  • 4-bit quantization via load_in_4bit
  • Unsloth gradient checkpointing (use_gradient_checkpointing="unsloth")
  • Fast inference via vLLM integration (fast_inference=True)

Ideal Usage

from unsloth import FastVisionModel

# Load Granite Docling VLM with Unsloth optimizations
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="ibm-granite/granite-docling-258M",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Apply LoRA
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    finetune_vision_layers=False,
    finetune_language_layers=True,
)

# Train with any TRL trainer (SFT, DPO, GRPO)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()

Implementation Suggestions

Based on our analysis of the Unsloth codebase, here's what we believe is needed:

1. Registry entry — new _idefics.py:

class Idefics3VLMeta(ModelMeta):
    is_multimodal = True
    model_type = "idefics3"
    architectures = ["Idefics3ForConditionalGeneration"]

2. Architecture patches — new idefics.py in models:

  • Attention optimizations for the Llama-based text model (can likely reuse existing Llama patches)
  • Optional vision encoder patches (SigLIP attention)
  • Pixel Shuffle connector handling

3. Support list updates:

  • Add "idefics3" to SUPPORTED_ARCHITECTURES in _utils.py
  • Add to VLLM_SUPPORTED_VLM in vision.py

4. Chat template — add Idefics3 template to chat_templates.py

5. LoRA target modules:

# Language model (Llama-based)
"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"

# Vision encoder (SigLIP)
"vision_model.encoder.layers.*.self_attn.{q,k,v,out}_proj"
"vision_model.encoder.layers.*.mlp.{fc1,fc2}"

# Connector
"image_connector.{proj_in,proj_out}", "image_connector.simple_mlp.{fc1,fc2}"

Architectural Similarity to Existing Models

Feature Idefics3 Similar Supported Model
Language backbone Llama 3-based Llama 3.2 Vision
Vision encoder SigLIP LLaVA
Attention type Standard multi-head LLaVA / Llama
Connector type Pixel Shuffle Unique (but simple linear projections)

Given the overlap, we estimate this could leverage much of the existing Llama + SigLIP optimization code.

Models That Would Benefit

  • ibm-granite/granite-docling-258M (document understanding)
  • HuggingFaceM4/Idefics3-8B-Llama3 (general VLM)
  • Any future Idefics3-based models

References

Happy to help with implementation or testing if useful!

Contributor guide