[Feature Request] Add Idefics3 architecture support (Granite Docling VLM) · unslothai/unsloth#4079

(3 comments) (0 reactions) (0 assignees)Python (5,658 forks)batch import

help wanted

Repository metrics

Stars: (64,271 stars)
PR merge metrics: (Avg merge 3d 15h) (525 merged PRs in 30d)

Description

Feature Request: Idefics3 Architecture Support

Summary

Requesting native Unsloth support for the Idefics3 architecture, which would enable optimized fine-tuning of models like IBM Granite Docling VLM (258M params) — a high-performing document understanding model.

Why This Matters

Granite Docling VLM achieves 87.7 on DocVQA with only 258M parameters (vs. the original Idefics3-8B at 74.0). It's Apache 2.0 licensed and increasingly used for document conversion (PDFs, scans, slides → structured output). Unsloth support would make fine-tuning this model significantly faster and more memory-efficient, opening it up to consumer GPUs.

Architecture Analysis

Granite Docling VLM (and all Idefics3 models) uses Idefics3ForConditionalGeneration. Its components map closely to things Unsloth already supports:

Component	Idefics3 / Granite Docling	Unsloth Status
Vision Encoder	SigLIP2-base-patch16-512	SigLIP supported in other VLMs (LLaVA, etc.)
Language Model	Granite 165M (Llama 3-based)	Llama fully supported
Connector	Pixel Shuffle projector (4x spatial compression)	Not yet in Unsloth
Model Class	`Idefics3ForConditionalGeneration`	Not registered
Config type	`model_type = "idefics3"` with `vision_config` + `text_config`	Would be detected as VLM, but lacks patches

The language model backbone is Llama-based, and the vision encoder is SigLIP — both already have Unsloth optimizations in other model families. The primary new component is the Pixel Shuffle connector that bridges vision→language.

Desired Scope

Full FastVisionModel support including:

SFT via SFTTrainer
DPO via DPOTrainer
GRPO / GSPO via GRPOTrainer
LoRA with selective layer training (finetune_vision_layers, finetune_language_layers, etc.)
4-bit quantization via load_in_4bit
Unsloth gradient checkpointing (use_gradient_checkpointing="unsloth")
Fast inference via vLLM integration (fast_inference=True)

Ideal Usage

from unsloth import FastVisionModel

# Load Granite Docling VLM with Unsloth optimizations
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="ibm-granite/granite-docling-258M",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Apply LoRA
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    finetune_vision_layers=False,
    finetune_language_layers=True,
)

# Train with any TRL trainer (SFT, DPO, GRPO)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()

Implementation Suggestions

Based on our analysis of the Unsloth codebase, here's what we believe is needed:

1. Registry entry — new _idefics.py:

class Idefics3VLMeta(ModelMeta):
    is_multimodal = True
    model_type = "idefics3"
    architectures = ["Idefics3ForConditionalGeneration"]

2. Architecture patches — new idefics.py in models:

Attention optimizations for the Llama-based text model (can likely reuse existing Llama patches)
Optional vision encoder patches (SigLIP attention)
Pixel Shuffle connector handling

3. Support list updates:

Add "idefics3" to SUPPORTED_ARCHITECTURES in _utils.py
Add to VLLM_SUPPORTED_VLM in vision.py

4. Chat template — add Idefics3 template to chat_templates.py

5. LoRA target modules:

# Language model (Llama-based)
"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"

# Vision encoder (SigLIP)
"vision_model.encoder.layers.*.self_attn.{q,k,v,out}_proj"
"vision_model.encoder.layers.*.mlp.{fc1,fc2}"

# Connector
"image_connector.{proj_in,proj_out}", "image_connector.simple_mlp.{fc1,fc2}"

Architectural Similarity to Existing Models

Feature	Idefics3	Similar Supported Model
Language backbone	Llama 3-based	Llama 3.2 Vision
Vision encoder	SigLIP	LLaVA
Attention type	Standard multi-head	LLaVA / Llama
Connector type	Pixel Shuffle	Unique (but simple linear projections)

Given the overlap, we estimate this could leverage much of the existing Llama + SigLIP optimization code.

Models That Would Benefit

ibm-granite/granite-docling-258M (document understanding)
HuggingFaceM4/Idefics3-8B-Llama3 (general VLM)
Any future Idefics3-based models

References

Happy to help with implementation or testing if useful!

Contributor guide

Research direction: Implement Idefics3 support by creating a new model file `idefics.py` with attention and connector patches, register the architecture in ` utils.py`, add LoRA target modules, and update chat templates. Reuse existing Llama and SigLIP optimizations where possible.
Tech stack: python
Domain: aimachine learning
Issue type: Feature
Difficulty: 4
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: PythonPyTorchHuggingFace Transformers
Newbie friendliness: 30