kornia/kornia

[Feature]: Expand test coverage for Qwen2VL vision-language model

Open

#3,556 建立於 2026年2月4日

在 GitHub 查看
 (7 留言) (0 反應) (2 負責人)Python (8,677 star) (892 fork)batch import
help wantedtriage

描述

🚀 Feature Description

Expand test coverage for the Qwen2VL vision-language model, which currently has minimal testing with only 33 test lines for 205 lines of implementation (0.16 ratio), consisting of smoke tests.

📂 Feature Category

VLM/VLA Models (Vision Language Models/Agents) - Priority

💡 Motivation

Current situation:

  • Qwen2VL has only 33 test lines (205 lines of implementation)
  • Current tests consist of only a basic smoke test
  • Missing critical tests:
    • No gradient checks (gradcheck)
    • No component-level tests
    • No torch.compile/dynamo tests
    • No integration tests with actual vision-language tasks
    • No pretrained weight loading verification (if applicable)
    • No batch consistency tests
    • No exception handling tests

Why expanded testing is needed:

  • Verify vision-language model functionality beyond basic instantiation
  • Ensure compatibility with PyTorch optimization features (torch.compile)
  • Test gradient flow through vision and language components
  • Prevent regressions in model architecture changes
  • Provide comprehensive usage examples
  • Match testing standards of other VLMs (e.g., KimiVL: 0.64 ratio)

💭 Proposed Solution

Migrate to BaseTester pattern:

  • ✅ Smoke tests: Multiple configurations, batch sizes, input formats
  • ✅ Exception tests: Invalid inputs, edge cases, error handling
  • ✅ Cardinality tests: Output shape verification for various inputs
  • ✅ Component tests: Individual module testing (vision encoder, language decoder, etc.)
  • ✅ Feature tests: Vision-language alignment, attention mechanisms
  • ✅ Gradient checks: Verify backpropagation correctness using gradcheck
  • ✅ Torch.compile compatibility: Test with torch_optimizer fixture
  • ✅ Integration tests: End-to-end vision-language tasks
  • ✅ Batch consistency: Verify batch vs. individual processing
  • ✅ Pretrained weights: Verify weight loading if pretrained models available

Additional Documentation

Jupyter notebook demonstrating:

  1. Model instantiation and configuration
  2. Vision-language inference examples
  3. Image-text alignment capabilities
  4. Comparison with official Qwen2VL implementation
  5. Performance benchmarks
  6. Integration with Hugging Face models (if applicable)

🔄 Alternatives Considered

No response

🎯 Use Cases

For developers:

  • Ensure vision-language model works correctly after changes
  • Test gradient flow through model components
  • Verify compiler optimization compatibility
  • Understand model architecture through tests

For users:

  • Learn proper usage patterns for vision-language tasks
  • See practical examples with images and text
  • Understand model capabilities and limitations
  • Get copy-paste examples for inference

For maintainers:

  • Maintain code quality standards across VLMs
  • Catch regressions before deployment
  • Ensure consistency with other VLM implementations

📝 Additional Context

Gap identified: Qwen2VL has the lowest test coverage (0.16) among VLMs in the repository.

🤝 Contribution Intent

  • I plan to submit a PR to implement this feature
  • I'm requesting this feature but not planning to implement it

貢獻者指南