[Bug] Forced coupling between num_generations and per_device_train_batch_size in GRPOTrainer resulting in OOM · unslothai/unsloth#3572

(3 留言) (1 反應) (0 負責人)Python (64,271 star) (5,658 fork)batch import

help wanted

描述

Did you update? pip install --upgrade unsloth unsloth_zoo. Yes
Colab or Kaggle or local / cloud. Cloud
Number GPUs used, use nvidia-smi. One Gpu.
Which notebook? Please link! https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb
Which Unsloth version, TRL version, transformers version, PyTorch version? Unsloth: 2025.11.2 TRL: 0.22.2 Transformers: 4.56.2 PyTorch: 2.8.0+cu128
Which trainer? SFTTrainer, GRPOTrainer etc. GrpoTrainer

When setting per_device_train_batch_size different from num_generations in GRPOConfig, a warning appears:

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 32.

However, num_generations is a critical parameter for GRPO and convergence — in your demo notebooks, it’s typically some small value. When the trainer automatically adjusts per_device_train_batch_size to match num_generations, this leads to out-of-memory (OOM) errors.

In other words, large num_generations values are necessary for stable training, but the enforced coupling makes GRPOTrainer practically unusable.

I’d like to understand the correct way to use a large num_generations value without running into out-of-memory (OOM) issues.

Note: Related to unslothai/unsloth#3149. In that closed issue, @mmathew23 commented:

“But if it does decrease num_generations to 6 and increase gradient_accumulation_steps to 4, you’ll still get the 12 generations per prompt per optimizer step.”

I don’t quite understand how this results in 12 generations per ONE prompt— there’s no arithmetic relationship between 6 and 4 that gives 12, either by multiplication or division.

貢獻者指南

技術棧: pythonpytorch
領域: machine learning
議題類型: bug
難度: 3
預計時間: 1-2 days
活動狀態: active
清晰度: clear
前置要求: PythonPyTorchGRPO trainingHugging Face Transformers
新手友善度: 45
研究方向: Examine the GRPOTrainer code in the unsloth repository to understand how per device train batch size and num generations are coupled (likely in the training loop or configuration handling). Look at issue #3149 for related discussion. The goal is to propose a fix that allows decoupling or provides a proper workaround to avoid OOM when num generations is large. Consider adjusting the logic to not automatically change batch size or to give users a configurable option.