Save cfg.TRAIN.START_EPOCH to state dict · STVIR/pysot#397

(5 comments) (0 reactions) (0 assignees)Python (4,314 stars) (1,153 forks)batch import

buggood first issuehas workaroundsmall

Description

Well #77 didn't work for me while resuming from checkpoint_18.pth. The problem is when we resume, the model and optimizer passed in the restore_from function are suitable for epoch less than 10 (till backbone is not training) because the cfg.TRAIN.START_EPOCH is 0 (passed in build_opt_lr function just before restore_from) initially so this mismatches the optimizer after backbone start training. So to resume my training , I pass the cfg.TRAIN.START_EPOCH as 19 and when build_opt_lr function receives epoch greater than 10 (i.e Backbone training starts) it produces the model and optimizer suitable for resuming. And i can resume my training.

Originally posted by @PhenomenalOnee in https://github.com/STVIR/pysot/issues/92#issuecomment-651571350

Contributor guide

Tech stack: pythonpytorch
Domain: machine learningai
Issue type: bug
Difficulty: 3
Estimated time: half day
Activity status: stale
Clarity: mostly clear
Prerequisites: basic understanding of PyTorch trainingfamiliarity with checkpointing
Newbie friendliness: 40
Research direction: Investigate how TRAIN.START EPOCH is used in the training scripts and optimizer initialization. The fix involves saving this config value in the checkpoint state dict and restoring it during resume. Look at files like train.py or core/train.py and the restore from function. See issue #92 for additional context.