Description
🚀 Feature Description
Target: Integrate Qwen3-VL (QwenLM/Qwen3-VL) as a first-class Vision-Language Model (VLM) inside the Kornia model zoo, following Kornia's lightweight-AI-model initiative. Scope: Dense models (2B, 4B, 8B,)
📂 Feature Category
VLM/VLA Models (Vision Language Models/Agents) - Priority
💡 Motivation
Vision-language models are rapidly becoming a core component of modern computer vision pipelines.
However, current implementations (such as in Hugging Face Transformers) are designed mainly for inference pipelines, not differentiable vision research.
Integrating Qwen3-VL into Kornia would enable:
Differentiable image preprocessing with Kornia
Multimodal research pipelines
Vision-language model experimentation
Integration with Kornia augmentation and geometry modules
💭 Proposed Solution
PR-01 — Project Scaffolding & Dependency Declaration
Branch: feat/qwen3vl-scaffolding
Depends on: main
Goal
Establish the package structure, optional-dependency declarations, and CI configuration that all subsequent PRs will build upon. No model code yet — just the skeleton.
Files to Create / Modify
kornia/models/vlm/
__init__.py # exports Qwen3VL symbols
qwen3vl/
__init__.py
_version.py # "2B", "4B", "8B", "32B"
kornia/models/__init__.py # add VLM sub-package to public API
PR-02 — Vision Encoder Wrapper (ViT + DeepStack)
Branch: feat/qwen3vl-vision-encoder
Depends on: PR-01
Goal
Expose Qwen3-VL's ViT backbone as a standalone Kornia module so that users can extract multi-level visual features independently of the full LLM, enabling embedding-based use cases.
Files to Create
kornia/models/vlm/qwen3vl/
vision_encoder.py # Qwen3VLVisionEncoder
deepstack.py # DeepStackFusion module
PR-03 — Image Preprocessor
Branch: feat/qwen3vl-image-preprocessor
Depends on: PR-01
Goal
Implement Kornia-native image preprocessing that matches Qwen3-VL's expected input format: dynamic resolution tiling, normalization, and image_grid_thw metadata generation — without requiring qwen-vl-utils at runtime (it becomes an optional fallback).
Files to Create
kornia/models/vlm/qwen3vl/
preprocessor.py # Qwen3VLImagePreprocessor
constants.py # MIN_PIXELS, MAX_PIXELS, MEAN, STD, PATCH_SIZE
PR-04 — Qwen3VLModel Core Class (Full Forward Pass)
Branch: feat/qwen3vl-core-model
Depends on: PR-02, PR-03
Goal
Implement the primary user-facing model class that exposes a clean Kornia-idiomatic API for multimodal text generation.
Files to Create
kornia/models/vlm/qwen3vl/
model.py # Qwen3VLModel
types.py # Qwen3VLInput, Qwen3VLOutput dataclasses
PR-05 — Builder & Weight-Loading Utilities
Branch: feat/qwen3vl-builder
Depends on: PR-04
Goal
Implement the Qwen3VLBuilder following the established Kornia Builder pattern (used by RTDETRDetectorBuilder, DexiNedBuilder, etc.), with support for all model sizes, caching, and optional local weight directories.
Files to Create / Modify
kornia/models/vlm/qwen3vl/
builder.py # Qwen3VLBuilder, Qwen3VLModelSize
kornia/models/vlm/__init__.py # re-export builder
PR-06 — Structured Output Parsers
Branch: feat/qwen3vl-output-parsers
Depends on: PR-04
Goal
Implement parser utilities that convert Qwen3-VL's raw text outputs into structured Python objects (bounding boxes as kornia.geometry.bbox, OCR results, grounding JSON) — enabling tight integration with Kornia's geometric primitives.
Files to Create
kornia/models/vlm/qwen3vl/
parsers.py # BBoxParser, OCRParser, GroundingParser
prompts.py # standard task prompt templates
PR-07 — Video Input Support
Branch: feat/qwen3vl-video
Depends on: PR-04, PR-03
Goal
Extend the model and preprocessor to handle video inputs: frame sampling, video_pad token injection, image_grid_thw with temporal dimension > 1, and Text-Timestamp Alignment for temporal grounding queries.
Files to Create / Modify
kornia/models/vlm/qwen3vl/
video_preprocessor.py # Qwen3VLVideoPreprocessor
model.py # extend Qwen3VLInput, add video_forward()
types.py # add VideoGroundingOutput
PR-08 — Tests, Benchmarks & Documentation
Branch: feat/qwen3vl-docs-tests
Depends on: PR-01 through PR-07
Goal
Deliver a complete test suite, performance benchmarks, interactive notebooks, and API documentation that brings Qwen3-VL to production-readiness within Kornia.
🤝 Contribution Intent
- I plan to submit a PR to implement this feature
This is not the final plan. The plan probably will extend since there is still more detail to be added for - main api class model And plan for text is still to be added