verl-project/verl

[FR] Documentation about Resolving OOM

Open

#1,014 建立於 2025年4月10日

在 GitHub 查看
 (6 留言) (7 反應) (0 負責人)Python (21,533 star) (3,940 fork)auto 404
call for contributiondocumentationgood first issue

描述

Motivation

There are many issues related to OOM, e.g. #328 . We might need a clear guide about how to resolve OOM.

Plan

A non-exclusive enumeration about related configurations:

  1. Rollout:gpu_memory_utilization
  2. Other Inference:
    1. Liger Kernel
    2. *_max_len_per_gpu / micro_batch_size_per_gpu
  3. Training:
    1. Liger Kernel
    2. Ulysses Sequence Parallelism
    3. gradient checkpointing
    4. offload

TODO

  • Complete the list of related configurations
  • Benchmark the effect & overhead of each configuration

貢獻者指南

[FR] Documentation about Resolving OOM · verl-project/verl#1014 | Good First Issue