verl-project/verl

[FR] Documentation about Resolving OOM

Open

#1014 opened on Apr 10, 2025

View on GitHub
 (6 comments) (7 reactions) (0 assignees)Python (21,533 stars) (3,940 forks)auto 404
call for contributiondocumentationgood first issue

Description

Motivation

There are many issues related to OOM, e.g. #328 . We might need a clear guide about how to resolve OOM.

Plan

A non-exclusive enumeration about related configurations:

  1. Rollout:gpu_memory_utilization
  2. Other Inference:
    1. Liger Kernel
    2. *_max_len_per_gpu / micro_batch_size_per_gpu
  3. Training:
    1. Liger Kernel
    2. Ulysses Sequence Parallelism
    3. gradient checkpointing
    4. offload

TODO

  • Complete the list of related configurations
  • Benchmark the effect & overhead of each configuration

Contributor guide