kornia/kornia
在 GitHub 查看[Feature]: Expand test coverage for Qwen2VL vision-language model
Open
#3,556 建立於 2026年2月4日
help wantedtriage
描述
🚀 Feature Description
Expand test coverage for the Qwen2VL vision-language model, which currently has minimal testing with only 33 test lines for 205 lines of implementation (0.16 ratio), consisting of smoke tests.
📂 Feature Category
VLM/VLA Models (Vision Language Models/Agents) - Priority
💡 Motivation
Current situation:
Qwen2VLhas only 33 test lines (205 lines of implementation)- Current tests consist of only a basic smoke test
- Missing critical tests:
- No gradient checks (gradcheck)
- No component-level tests
- No torch.compile/dynamo tests
- No integration tests with actual vision-language tasks
- No pretrained weight loading verification (if applicable)
- No batch consistency tests
- No exception handling tests
Why expanded testing is needed:
- Verify vision-language model functionality beyond basic instantiation
- Ensure compatibility with PyTorch optimization features (torch.compile)
- Test gradient flow through vision and language components
- Prevent regressions in model architecture changes
- Provide comprehensive usage examples
- Match testing standards of other VLMs (e.g., KimiVL: 0.64 ratio)
💭 Proposed Solution
Migrate to BaseTester pattern:
- ✅ Smoke tests: Multiple configurations, batch sizes, input formats
- ✅ Exception tests: Invalid inputs, edge cases, error handling
- ✅ Cardinality tests: Output shape verification for various inputs
- ✅ Component tests: Individual module testing (vision encoder, language decoder, etc.)
- ✅ Feature tests: Vision-language alignment, attention mechanisms
- ✅ Gradient checks: Verify backpropagation correctness using
gradcheck - ✅ Torch.compile compatibility: Test with
torch_optimizerfixture - ✅ Integration tests: End-to-end vision-language tasks
- ✅ Batch consistency: Verify batch vs. individual processing
- ✅ Pretrained weights: Verify weight loading if pretrained models available
Additional Documentation
Jupyter notebook demonstrating:
- Model instantiation and configuration
- Vision-language inference examples
- Image-text alignment capabilities
- Comparison with official Qwen2VL implementation
- Performance benchmarks
- Integration with Hugging Face models (if applicable)
🔄 Alternatives Considered
No response
🎯 Use Cases
For developers:
- Ensure vision-language model works correctly after changes
- Test gradient flow through model components
- Verify compiler optimization compatibility
- Understand model architecture through tests
For users:
- Learn proper usage patterns for vision-language tasks
- See practical examples with images and text
- Understand model capabilities and limitations
- Get copy-paste examples for inference
For maintainers:
- Maintain code quality standards across VLMs
- Catch regressions before deployment
- Ensure consistency with other VLM implementations
📝 Additional Context
Gap identified: Qwen2VL has the lowest test coverage (0.16) among VLMs in the repository.
🤝 Contribution Intent
- I plan to submit a PR to implement this feature
- I'm requesting this feature but not planning to implement it