[Feature]: Reduce redundant tensor layout transformations in Image API · kornia/kornia#3696

(2 comments) (0 reactions) (0 assignees)Python (8,677 stars) (892 forks)batch import

help wantedtriage

説明

🚀 Feature Description

Optimize the kornia.image.Image API by minimizing redundant tensor layout transformations (permute) and enabling more efficient image processing pipelines through canonical layout handling and optional in-place operations.

📂 Feature Category

Image Processing

💡 Motivation

Currently, multiple methods in the Image class (e.g., to_gray, to_rgb, to_bgr) internally perform repeated layout conversions when handling CHANNELS_LAST inputs.

Typical pattern:

Convert to CHANNELS_FIRST via permute Apply operation Convert back via permute

This results in:

Repeated tensor stride changes and potential memory reordering Increased overhead in chained pipelines Avoidable data movement, especially for large images or batched inputs

While Kornia already supports zero-copy interop via DLPack, these internal transformations introduce a separate performance bottleneck at the execution level.

💭 Proposed Solution

Option 1: Canonical Internal Layout

Standardize internal representation to CHANNELS_FIRST:

Convert once at construction or entry point Perform all operations in canonical layout Convert back only when explicitly requested Option 2: Lazy Layout Handling

Track layout via metadata without immediate permute:

Defer physical layout transformation until required Avoid unnecessary conversions in chained operations Option 3: In-place Variants

Introduce in-place APIs to reduce allocations:

img.to_gray_() img.to_rgb_() Option 4: Operation Fusion (optional future work)

Enable direct conversions (e.g., BGR → GRAY) without intermediate representations.

🔄 Alternatives Considered

Keeping current explicit layout handling (simpler but less efficient) Relying on users to manually normalize layout before using API (error-prone and not ergonomic)

🎯 Use Cases

High-throughput image pipelines (e.g., preprocessing for deep learning) Robotics / real-time vision systems Batched image transformations on GPU Scenarios where minimizing memory movement is critical

📝 Additional Context

This issue is conceptually similar to how DLPack eliminates unnecessary memory copies across frameworks. Here, the goal is to reduce intra-framework data movement by optimizing layout handling and execution flow.

Initial profiling suggests that redundant permute operations can contribute significantly to runtime in chained transformations.

🤝 Contribution Intent

I plan to submit a PR to implement this feature
I'm requesting this feature but not planning to implement it

コントリビューターガイド

技術スタック
領域
Issue 種別
難度
推定時間
活動状況
明確さ
前提条件
初心者向け度
調査方針