Lightning-AI/pytorch-lightning

enable loading `universal checkpointing` checkpoint in `DeepSpeedStrategy`

Open

#20,065 创建于 2024年7月9日

在 GitHub 查看
 (1 评论) (0 反应) (0 负责人)Python (26,687 star) (3,233 fork)batch import
featurehelp wantedstrategy: deepspeed

描述

Description & Motivation

After I trained a model in some numbers of gpus, say, 8 gpus for a while, It's difficult to load the checkpoint to 16 gpus with optimizer and model states unchanged. The deepspeed has developed the universal checkpointing strategy to solve this problem, but I didn't see the pytorch-lightning has this feature.

Pitch

I want the pytorch-lightning could support this feature

Alternatives

try to add universal_checkpoint as a param of DeepSpeedStrategy and modify the class refering to https://www.deepspeed.ai/tutorials/universal-checkpointing/

Additional context

No response

cc @borda @awaelchli

贡献者指南