[RFC] Batteries Included - Phase 3 · pytorch/vision#6323

Repository metrics

Stars: (15,050 stars)
PR merge metrics: (Avg merge 12d 8h) (14 merged PRs in 30d)

Description

🚀 The feature

Note: To track the progress of the project check out this board.

This is the 3rd phase of TorchVision's modernization project (see phase 1 and 2). We aim to keep TorchVision relevant by ensuring it provides off-the-shelf all the necessary primitives, model architectures and recipe utilities to produce SOTA results for the supported Computer Vision tasks.

To enable our users to reproduce the latest state-of-the-art research we will enhance TorchVision with the following data augmentations, layers, losses and other operators:

Data Augmentations

AutoAugment for Detection [1, 2] - #6224 #6609
Mosaic [1, 2] - #6534
Mixup for Detection [1, 2] - #6720 #6721

Losses

Dice Loss [1, 2] - #6435 #6960
Poly Loss [1, 2] - #6439 #6457

Operators added in PyTorch Core

LARS Optimizer [1, 2] - https://github.com/pytorch/pytorch/pull/88106
LAMB Optimizer [1, 2] - #6868
Polynomial LR Scheduler [1, 2] - code - https://github.com/pytorch/pytorch/pull/82769

To ensure that our users have access to the most popular SOTA models, we will add the following architectures along with pre-trained weights:

Image Classification

Swin Transformer V2 - #6242 #6246
MobileViT v1 & v2 [1, 2] - #6404
MaxViT - #6342

Video Classification

MViTv2 [1] - #6373
Swin3d [1] - #6499 #6521
S3D [1] - #6402 #6412 #6537

To ensure that are users can have access to strong baselines and SOTA weights, we will improve our training recipes to incorporate the newly released primitives and offer improved pre-trained models:

Reference Scripts

Update the Reference Scripts to use the latest primitives - #6405 #6433

Pre-trained weights

Improve the accuracy of Video models

Other Candidates

There are several other Operators (#5414), Losses (#2980), Augmentations (#3817) and Models (#2707) proposed by the community. Here are some potential candidates that we could implement depending on bandwidth. Contributions are welcome for any of the below:

YOLOX [1] - #6341
DeTR - #5922 #6922
U-Net - #6610 #6611
MViTv2 for Images [1]
Video Transformer Network [1]
MTV
Deformable DeTR
Shortcut Regularizer (FX-based)
Hide-and-Seek - #6796

cc @datumbox @vfdev-5

Contributor guide

Research direction: Select one sub task from the checklist (e.g., Mosaic augmentation). Read the linked references and existing implementations. Understand how to integrate it into TorchVision's existing codebase and propose a PR.
Tech stack: pythonpytorch
Domain: machine learningai
Issue type: Research
Difficulty: 4
Estimated time: Over 1 week
Activity status: Active
Clarity: Mostly clear
Prerequisites: PythonPyTorchGitComputer Vision
Newbie friendliness: 15