It seems that the "save_checkpoint" method is not implemented.
貢獻者指南
技術棧
pythonpytorch
領域
backendmachine learning
議題類型
feature
難度面向新貢獻者的預計實作難度,1 表示很小改動,5 表示專家級工作。
3
預計時間有經驗貢獻者完成調查、實作、測試並準備 pull request 的粗略時間範圍。
1-3 hours
活動狀態議題目前的可參與程度:新鮮、活躍、陳舊、阻塞或等待維護者輸入。
active
清晰度議題是否清楚說明預期改動、驗收標準和下一步。
clear
前置要求
PythonPyTorchMegatron (distributed training)
新手友善度1-100 的估計分數,表示該議題對首次貢獻者的友善程度。
55
研究方向
Investigate the missing `save checkpoint` method in `verl/workers/megatron workers.py` at line 428. Examine existing checkpoint implementations in other VERL workers (e.g., `fsdp workers.py`) and Megatron LM's checkpointing utilities. Implement a method that saves model and optimizer states in a format compatible with `load checkpoint`. Review the 4 comments on the issue for any additional context or proposed approaches.