It seems that the "save_checkpoint" method is not implemented.
贡献者指南
技术栈
pythonpytorch
领域
backendmachine learning
议题类型
feature
难度面向新贡献者的预计实现难度,1 表示很小改动,5 表示专家级工作。
3
预计时间有经验贡献者完成调查、实现、测试并准备 pull request 的粗略时间范围。
1-3 hours
活动状态议题当前的可参与程度:新鲜、活跃、陈旧、阻塞或等待维护者输入。
active
清晰度议题是否清楚说明期望改动、验收标准和下一步。
clear
前置要求
PythonPyTorchMegatron (distributed training)
新手友好度1-100 的估计分数,表示该议题对首次贡献者的友好程度。
55
研究方向
Investigate the missing `save checkpoint` method in `verl/workers/megatron workers.py` at line 428. Examine existing checkpoint implementations in other VERL workers (e.g., `fsdp workers.py`) and Megatron LM's checkpointing utilities. Implement a method that saves model and optimizer states in a format compatible with `load checkpoint`. Review the 4 comments on the issue for any additional context or proposed approaches.