Additional memory optimization features · verl-project/verl#144

(4 留言) (3 反應) (0 負責人)Python (21,533 star) (3,940 fork)auto 404

call for contributionenhancementgood first issue

描述

Activation offloading (see implementation here)
Fusing optimizer step into backward pass (see implementation here)
Utilize full_shard reshard_after_forward (see here). I wasn't 100% sure if I could see this already implemented in veRL.

These optimizations largely trade off decreased peak memory useage for additional compute, so may only be useful for training larger models, and in GPU-constrained settings.

貢獻者指南

技術棧: pythonpytorch
領域: machine learningperformance
議題類型: feature
難度: 4
預計時間: over 1 week
活動狀態: fresh
清晰度: mostly clear
前置要求: PythonPyTorchFSDP
新手友善度: 15
研究方向: Explore the provided links to activation offloading and fused optimizer step implementations from torchtune (torchtune/training/ activation offloading.py and torchtune/training/memory.py). Study the FSDP2 API differences in torchtitan (docs/fsdp.md). Then, survey the verI codebase to identify where similar memory optimizations can be integrated, focusing on the training loop and model sharding. Consider discussing with maintainers on which optimization is most needed first.