kornia/kornia

[Feature]: Qwen 3 VL model

Open

#3622 opened on Mar 16, 2026

View on GitHub
 (16 comments) (0 reactions) (3 assignees)Python (8,677 stars) (892 forks)batch import
help wantedtriage

Description

🚀 Feature Description

Target: Integrate Qwen3-VL (QwenLM/Qwen3-VL) as a first-class Vision-Language Model (VLM) inside the Kornia model zoo, following Kornia's lightweight-AI-model initiative. Scope: Dense models (2B, 4B, 8B,)

📂 Feature Category

VLM/VLA Models (Vision Language Models/Agents) - Priority

💡 Motivation

Vision-language models are rapidly becoming a core component of modern computer vision pipelines.

However, current implementations (such as in Hugging Face Transformers) are designed mainly for inference pipelines, not differentiable vision research.

Integrating Qwen3-VL into Kornia would enable:

Differentiable image preprocessing with Kornia

Multimodal research pipelines

Vision-language model experimentation

Integration with Kornia augmentation and geometry modules

💭 Proposed Solution

PR-01 — Project Scaffolding & Dependency Declaration

Branch: feat/qwen3vl-scaffolding
Depends on: main

Goal

Establish the package structure, optional-dependency declarations, and CI configuration that all subsequent PRs will build upon. No model code yet — just the skeleton.

Files to Create / Modify

kornia/models/vlm/
    __init__.py           # exports Qwen3VL symbols
    qwen3vl/
        __init__.py
        _version.py       # "2B", "4B", "8B", "32B"

kornia/models/__init__.py  # add VLM sub-package to public API

PR-02 — Vision Encoder Wrapper (ViT + DeepStack)

Branch: feat/qwen3vl-vision-encoder
Depends on: PR-01

Goal

Expose Qwen3-VL's ViT backbone as a standalone Kornia module so that users can extract multi-level visual features independently of the full LLM, enabling embedding-based use cases.

Files to Create

kornia/models/vlm/qwen3vl/
    vision_encoder.py     # Qwen3VLVisionEncoder
    deepstack.py          # DeepStackFusion module

PR-03 — Image Preprocessor

Branch: feat/qwen3vl-image-preprocessor
Depends on: PR-01

Goal

Implement Kornia-native image preprocessing that matches Qwen3-VL's expected input format: dynamic resolution tiling, normalization, and image_grid_thw metadata generation — without requiring qwen-vl-utils at runtime (it becomes an optional fallback).

Files to Create

kornia/models/vlm/qwen3vl/
    preprocessor.py       # Qwen3VLImagePreprocessor
    constants.py          # MIN_PIXELS, MAX_PIXELS, MEAN, STD, PATCH_SIZE

PR-04 — Qwen3VLModel Core Class (Full Forward Pass)

Branch: feat/qwen3vl-core-model
Depends on: PR-02, PR-03

Goal

Implement the primary user-facing model class that exposes a clean Kornia-idiomatic API for multimodal text generation.

Files to Create

kornia/models/vlm/qwen3vl/
    model.py              # Qwen3VLModel
    types.py              # Qwen3VLInput, Qwen3VLOutput dataclasses

PR-05 — Builder & Weight-Loading Utilities

Branch: feat/qwen3vl-builder
Depends on: PR-04

Goal

Implement the Qwen3VLBuilder following the established Kornia Builder pattern (used by RTDETRDetectorBuilder, DexiNedBuilder, etc.), with support for all model sizes, caching, and optional local weight directories.

Files to Create / Modify

kornia/models/vlm/qwen3vl/
    builder.py            # Qwen3VLBuilder, Qwen3VLModelSize
kornia/models/vlm/__init__.py  # re-export builder

PR-06 — Structured Output Parsers

Branch: feat/qwen3vl-output-parsers
Depends on: PR-04

Goal

Implement parser utilities that convert Qwen3-VL's raw text outputs into structured Python objects (bounding boxes as kornia.geometry.bbox, OCR results, grounding JSON) — enabling tight integration with Kornia's geometric primitives.

Files to Create

kornia/models/vlm/qwen3vl/
    parsers.py            # BBoxParser, OCRParser, GroundingParser
    prompts.py            # standard task prompt templates

PR-07 — Video Input Support

Branch: feat/qwen3vl-video
Depends on: PR-04, PR-03

Goal

Extend the model and preprocessor to handle video inputs: frame sampling, video_pad token injection, image_grid_thw with temporal dimension > 1, and Text-Timestamp Alignment for temporal grounding queries.

Files to Create / Modify

kornia/models/vlm/qwen3vl/
    video_preprocessor.py  # Qwen3VLVideoPreprocessor
    model.py               # extend Qwen3VLInput, add video_forward()
    types.py               # add VideoGroundingOutput

PR-08 — Tests, Benchmarks & Documentation

Branch: feat/qwen3vl-docs-tests
Depends on: PR-01 through PR-07

Goal

Deliver a complete test suite, performance benchmarks, interactive notebooks, and API documentation that brings Qwen3-VL to production-readiness within Kornia.

🤝 Contribution Intent

  • I plan to submit a PR to implement this feature

This is not the final plan. The plan probably will extend since there is still more detail to be added for - main api class model And plan for text is still to be added

Contributor guide