verl-project/verl

[FR] Documentation about Resolving OOM

Open

#1,014 创建于 2025年4月10日

在 GitHub 查看
 (6 评论) (7 反应) (0 负责人)Python (21,533 star) (3,940 fork)auto 404
call for contributiondocumentationgood first issue

描述

Motivation

There are many issues related to OOM, e.g. #328 . We might need a clear guide about how to resolve OOM.

Plan

A non-exclusive enumeration about related configurations:

  1. Rollout:gpu_memory_utilization
  2. Other Inference:
    1. Liger Kernel
    2. *_max_len_per_gpu / micro_batch_size_per_gpu
  3. Training:
    1. Liger Kernel
    2. Ulysses Sequence Parallelism
    3. gradient checkpointing
    4. offload

TODO

  • Complete the list of related configurations
  • Benchmark the effect & overhead of each configuration

贡献者指南