[Feature]: Reduce redundant tensor layout transformations in Image API
#3,696 opened on 2026年4月5日
説明
🚀 Feature Description
Optimize the kornia.image.Image API by minimizing redundant tensor layout transformations (permute) and enabling more efficient image processing pipelines through canonical layout handling and optional in-place operations.
📂 Feature Category
Image Processing
💡 Motivation
Currently, multiple methods in the Image class (e.g., to_gray, to_rgb, to_bgr) internally perform repeated layout conversions when handling CHANNELS_LAST inputs.
Typical pattern:
Convert to CHANNELS_FIRST via permute Apply operation Convert back via permute
This results in:
Repeated tensor stride changes and potential memory reordering Increased overhead in chained pipelines Avoidable data movement, especially for large images or batched inputs
While Kornia already supports zero-copy interop via DLPack, these internal transformations introduce a separate performance bottleneck at the execution level.
💭 Proposed Solution
Option 1: Canonical Internal Layout
Standardize internal representation to CHANNELS_FIRST:
Convert once at construction or entry point Perform all operations in canonical layout Convert back only when explicitly requested Option 2: Lazy Layout Handling
Track layout via metadata without immediate permute:
Defer physical layout transformation until required Avoid unnecessary conversions in chained operations Option 3: In-place Variants
Introduce in-place APIs to reduce allocations:
img.to_gray_() img.to_rgb_() Option 4: Operation Fusion (optional future work)
Enable direct conversions (e.g., BGR → GRAY) without intermediate representations.
🔄 Alternatives Considered
Keeping current explicit layout handling (simpler but less efficient) Relying on users to manually normalize layout before using API (error-prone and not ergonomic)
🎯 Use Cases
High-throughput image pipelines (e.g., preprocessing for deep learning) Robotics / real-time vision systems Batched image transformations on GPU Scenarios where minimizing memory movement is critical
📝 Additional Context
This issue is conceptually similar to how DLPack eliminates unnecessary memory copies across frameworks. Here, the goal is to reduce intra-framework data movement by optimizing layout handling and execution flow.
Initial profiling suggests that redundant permute operations can contribute significantly to runtime in chained transformations.
🤝 Contribution Intent
- I plan to submit a PR to implement this feature
- I'm requesting this feature but not planning to implement it