[RFC] Batteries Included - Phase 3 · pytorch/vision#6323

仓库指标

Star: (15,050 star)
PR 合并指标: (平均合并 12天 8小时) (30 天内合并 14 个 PR)

描述

🚀 The feature

Note: To track the progress of the project check out this board.

This is the 3rd phase of TorchVision's modernization project (see phase 1 and 2). We aim to keep TorchVision relevant by ensuring it provides off-the-shelf all the necessary primitives, model architectures and recipe utilities to produce SOTA results for the supported Computer Vision tasks.

To enable our users to reproduce the latest state-of-the-art research we will enhance TorchVision with the following data augmentations, layers, losses and other operators:

Data Augmentations

AutoAugment for Detection [1, 2] - #6224 #6609
Mosaic [1, 2] - #6534
Mixup for Detection [1, 2] - #6720 #6721

Losses

Dice Loss [1, 2] - #6435 #6960
Poly Loss [1, 2] - #6439 #6457

Operators added in PyTorch Core

LARS Optimizer [1, 2] - https://github.com/pytorch/pytorch/pull/88106
LAMB Optimizer [1, 2] - #6868
Polynomial LR Scheduler [1, 2] - code - https://github.com/pytorch/pytorch/pull/82769

To ensure that our users have access to the most popular SOTA models, we will add the following architectures along with pre-trained weights:

Image Classification

Swin Transformer V2 - #6242 #6246
MobileViT v1 & v2 [1, 2] - #6404
MaxViT - #6342

Video Classification

MViTv2 [1] - #6373
Swin3d [1] - #6499 #6521
S3D [1] - #6402 #6412 #6537

To ensure that are users can have access to strong baselines and SOTA weights, we will improve our training recipes to incorporate the newly released primitives and offer improved pre-trained models:

Reference Scripts

Update the Reference Scripts to use the latest primitives - #6405 #6433

Pre-trained weights

Improve the accuracy of Video models

Other Candidates

There are several other Operators (#5414), Losses (#2980), Augmentations (#3817) and Models (#2707) proposed by the community. Here are some potential candidates that we could implement depending on bandwidth. Contributions are welcome for any of the below:

YOLOX [1] - #6341
DeTR - #5922 #6922
U-Net - #6610 #6611
MViTv2 for Images [1]
Video Transformer Network [1]
MTV
Deformable DeTR
Shortcut Regularizer (FX-based)
Hide-and-Seek - #6796

cc @datumbox @vfdev-5

贡献者指南

研究方向: 这是一个跟踪 TorchVision 现代化第三阶段的元问题。它列出了许多子任务，例如新原语（自动增强、马赛克、Dice 损失等）、新模型架构（MobileViT 等）以及改进的训练配方。每个子任务都有链接的问题和 PR。要贡献，新手应首先从复选框中选择一个具体的子任务，阅读链接的问题，并理解其范围。此问题本身不包含完整信息；需要对具体子项进行调查。
技术栈: pythonpytorch
领域: machine learningai
议题类型: 调研
难度: 4
预计时间: 超过 1 周
活动状态: 活跃
清晰度: 基本清晰
前置要求: PythonPyTorchGitComputer Vision
新手友好度: 15