[FR] Documentation about Resolving OOM · verl-project/verl#1014

(6 comments) (7 reactions) (0 assignees)Python (21,533 stars) (3,940 forks)auto 404

call for contributiondocumentationgood first issue

Description

There are many issues related to OOM, e.g. #328 . We might need a clear guide about how to resolve OOM.

A non-exclusive enumeration about related configurations:

Rollout：gpu_memory_utilization
Other Inference：
1. Liger Kernel
2. *_max_len_per_gpu / micro_batch_size_per_gpu
Training:
1. Liger Kernel
2. Ulysses Sequence Parallelism
3. gradient checkpointing
4. offload

Tech stack: pythonpytorch
Domain: documentationperformance
Issue type: documentation
Difficulty: 3
Estimated time: 1-2 days
Activity status: active
Clarity: clear
Prerequisites: basic Pythonfamiliarity with PyTorchunderstanding of GPU memory
Newbie friendliness: 70
Research direction: Investigate existing OOM related issues, e.g., #328, to identify common scenarios. For each configuration listed (rollout's gpu memory utilization, Liger Kernel, max len/gpu, gradient checkpointing, etc.), benchmark their effect on memory usage and training performance. Document findings in a structured guide, referencing relevant source files like configuration scripts and training loops.