verl-project/verl

[FR] Documentation about Resolving OOM

Open

#1,014 opened on 2025年4月10日

GitHub で見る
 (6 comments) (7 reactions) (0 assignees)Python (21,533 stars) (3,940 forks)auto 404
call for contributiondocumentationgood first issue

説明

Motivation

There are many issues related to OOM, e.g. #328 . We might need a clear guide about how to resolve OOM.

Plan

A non-exclusive enumeration about related configurations:

  1. Rollout:gpu_memory_utilization
  2. Other Inference:
    1. Liger Kernel
    2. *_max_len_per_gpu / micro_batch_size_per_gpu
  3. Training:
    1. Liger Kernel
    2. Ulysses Sequence Parallelism
    3. gradient checkpointing
    4. offload

TODO

  • Complete the list of related configurations
  • Benchmark the effect & overhead of each configuration

コントリビューターガイド