unslothai/unsloth

[Feature Request] Add Idefics3 architecture support (Granite Docling VLM)

Open

#4.079 aberto em 19 de fev. de 2026

Ver no GitHub
 (3 comments) (0 reactions) (0 assignees)Python (5.658 forks)batch import
help wanted

Métricas do repositório

Stars
 (64.271 stars)
Métricas de merge de PR
 (Mesclagem média 3d 15h) (525 fundiu PRs em 30d)

Description

Feature Request: Idefics3 Architecture Support

Summary

Requesting native Unsloth support for the Idefics3 architecture, which would enable optimized fine-tuning of models like IBM Granite Docling VLM (258M params) — a high-performing document understanding model.

Why This Matters

Granite Docling VLM achieves 87.7 on DocVQA with only 258M parameters (vs. the original Idefics3-8B at 74.0). It's Apache 2.0 licensed and increasingly used for document conversion (PDFs, scans, slides → structured output). Unsloth support would make fine-tuning this model significantly faster and more memory-efficient, opening it up to consumer GPUs.

Architecture Analysis

Granite Docling VLM (and all Idefics3 models) uses Idefics3ForConditionalGeneration. Its components map closely to things Unsloth already supports:

Component Idefics3 / Granite Docling Unsloth Status
Vision Encoder SigLIP2-base-patch16-512 SigLIP supported in other VLMs (LLaVA, etc.)
Language Model Granite 165M (Llama 3-based) Llama fully supported
Connector Pixel Shuffle projector (4x spatial compression) Not yet in Unsloth
Model Class Idefics3ForConditionalGeneration Not registered
Config type model_type = "idefics3" with vision_config + text_config Would be detected as VLM, but lacks patches

The language model backbone is Llama-based, and the vision encoder is SigLIP — both already have Unsloth optimizations in other model families. The primary new component is the Pixel Shuffle connector that bridges vision→language.

Desired Scope

Full FastVisionModel support including:

  • SFT via SFTTrainer
  • DPO via DPOTrainer
  • GRPO / GSPO via GRPOTrainer
  • LoRA with selective layer training (finetune_vision_layers, finetune_language_layers, etc.)
  • 4-bit quantization via load_in_4bit
  • Unsloth gradient checkpointing (use_gradient_checkpointing="unsloth")
  • Fast inference via vLLM integration (fast_inference=True)

Ideal Usage

from unsloth import FastVisionModel

# Load Granite Docling VLM with Unsloth optimizations
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="ibm-granite/granite-docling-258M",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Apply LoRA
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    finetune_vision_layers=False,
    finetune_language_layers=True,
)

# Train with any TRL trainer (SFT, DPO, GRPO)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()

Implementation Suggestions

Based on our analysis of the Unsloth codebase, here's what we believe is needed:

1. Registry entry — new _idefics.py:

class Idefics3VLMeta(ModelMeta):
    is_multimodal = True
    model_type = "idefics3"
    architectures = ["Idefics3ForConditionalGeneration"]

2. Architecture patches — new idefics.py in models:

  • Attention optimizations for the Llama-based text model (can likely reuse existing Llama patches)
  • Optional vision encoder patches (SigLIP attention)
  • Pixel Shuffle connector handling

3. Support list updates:

  • Add "idefics3" to SUPPORTED_ARCHITECTURES in _utils.py
  • Add to VLLM_SUPPORTED_VLM in vision.py

4. Chat template — add Idefics3 template to chat_templates.py

5. LoRA target modules:

# Language model (Llama-based)
"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"

# Vision encoder (SigLIP)
"vision_model.encoder.layers.*.self_attn.{q,k,v,out}_proj"
"vision_model.encoder.layers.*.mlp.{fc1,fc2}"

# Connector
"image_connector.{proj_in,proj_out}", "image_connector.simple_mlp.{fc1,fc2}"

Architectural Similarity to Existing Models

Feature Idefics3 Similar Supported Model
Language backbone Llama 3-based Llama 3.2 Vision
Vision encoder SigLIP LLaVA
Attention type Standard multi-head LLaVA / Llama
Connector type Pixel Shuffle Unique (but simple linear projections)

Given the overlap, we estimate this could leverage much of the existing Llama + SigLIP optimization code.

Models That Would Benefit

  • ibm-granite/granite-docling-258M (document understanding)
  • HuggingFaceM4/Idefics3-8B-Llama3 (general VLM)
  • Any future Idefics3-based models

References

Happy to help with implementation or testing if useful!

Guia do colaborador