Repository metrics

Stars: (64,271 stars)
PR merge metrics: (Avg merge 3d 15h) (525 merged PRs in 30d)

Description

Here is the full log (I added memory debugging too manually). I cannot get GRPO to work and docs say it should fit in 140GB and I have 183GB. It seems mostly to do with KV cache but this seems insane its already at only 4k context length.

(uni_grpo) root@gorgeous-chicken-of-unity:~/uni_grpo# python3 train.py 🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! [WARNING] HF_TOKEN not set. Uploads skipped. [GPU] NVIDIA B200 [GPU] Total VRAM: 178.35 GB

[VRAM] Initial: 0.00 GB alloc, 0.00 GB reserved, 178.35 GB free [model] Loading unsloth/gpt-oss-120b, max_seq_length=4608 ==((====))== Unsloth 2025.10.1: Fast Gpt_Oss patching. Transformers: 4.56.2. \ /| NVIDIA B200. Num GPUs = 1. Max memory: 178.351 GB. Platform: Linux. O^O/ _/ \ Torch: 2.8.0+cu128. CUDA: 10.0. CUDA Toolkit: 12.8. Triton: 3.4.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33+5146f2a.d20251005. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 16/16 [00:29<00:00, 1.85s/it] Unsloth: Offloading embeddings to RAM to save 1.08 GB. [VRAM] After base model load: 56.85 GB alloc, 56.88 GB reserved, 121.47 GB free

[generation_config] Creating explicit generation config [generation_config] Setting max_length=4608 [generation_config] Setting max_new_tokens=512 [generation_config] Verified model.generation_config.max_length = 4608 [generation_config] Verified model.generation_config.max_new_tokens = 512

[VRAM] After for_inference: 56.85 GB alloc, 56.88 GB reserved, 121.47 GB free [lora] Loading adapter from grpo-adapter-step-20 [VRAM] After LoRA adapter load: 56.94 GB alloc, 57.06 GB reserved, 121.29 GB free [lora] Trainable: 23,887,872 / 59,044,394,304 (0.0405%)

[VRAM] After gradient checkpointing: 56.94 GB alloc, 57.06 GB reserved, 121.29 GB free [VRAM] After GC: 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free [VRAM] After dataset load: 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free - 17200 samples

[config] per_device_batch_size: 1 [config] gradient_accumulation: 4 [config] num_generations: 2 [config] Effective batch: 4 [config] Completions in memory: 2

Unsloth: We now expect per_device_train_batch_size to be a multiple of num_generations. We will change the batch size of 1 to the num_generations of 2 [VRAM] After trainer init: 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free

[generation_config] Final check before training:

model.generation_config.max_length: 4608
model.generation_config.max_new_tokens: 512

[VRAM] Before training (post-GC): 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free

================================================================================ STARTING TRAINING

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}. ==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \ /| Num examples = 17,200 | Num Epochs = 1 | Total steps = 4,300 O^O/ _/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 23,887,872 of 116,853,044,544 (0.02% trained) 0%| | 0/4300 [00:00<?, ?it/s]

ENTERING FIRST TRAINING STEP - DETAILED MEMORY TRACKING

[VRAM] Step 0: Before generation starts: 58.03 GB alloc, 58.07 GB reserved, 120.28 GB free

[generation_config] max_length: 4608 [generation_config] max_new_tokens: 512

[MEMORY CALC] Generation parameters:

Batch size: 2
Num generations: 2
Total sequences: 4
Max length per sequence: 4608
Total tokens to generate: 18432

[MEMORY CALC] Estimated KV cache: 67.50 GB

Formula: 2 × 80 layers × 96 heads × 128 dim × 4608 tokens × 4 seqs × 2 bytes [MEMORY CALC] Estimated swiglu activations: 2.25 GB
Formula: 4 seqs × 4608 tokens × 32768 intermediate × 2 × 2 bytes

[MEMORY CALC] Total estimated for generation: 69.75 GB

generation_config default values have been modified to match model-specific defaults: {'max_length': 4608}. If this is not desired, please set these values explicitly. Traceback (most recent call last): File "/root/uni_grpo/train.py", line 628, in main() File "/root/uni_grpo/train.py", line 613, in main train_result = trainer.train() File "/root/uni_grpo/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 53, in wrapper output = f(self, *args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in train return inner_training_loop( File "", line 323, in _fast_inner_training_loop File "", line 34, in _unsloth_training_step File "/root/uni_grpo/.venv/lib/python3.10/site-packages/trl/extras/profiling.py", line 98, in wrapper return func(self, *args, **kwargs) File "/root/uni_grpo/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 2015, in _prepare_inputs generation_batch = self._generate_and_score_completions(generation_batch) File "/root/uni_grpo/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 2323, in _generate_and_score_completions prompt_completion_ids = unwrapped_model.generate( File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth/models/rl.py", line 71, in generate_with_clone out = original_generate(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/peft/peft_model.py", line 1973, in generate outputs = self.base_model.generate(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth/models/vision.py", line 266, in unsloth_base_fast_generate output = self._old_generate(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2539, in generate result = self._sample( File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2867, in _sample outputs = self(**model_inputs, return_dict=True) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/root/uni_grpo/unsloth_compiled_cache/unsloth_compiled_module_gpt_oss.py", line 721, in forward return GptOssForCausalLM_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_router_logits, cache_position, logits_to_keep, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 198, in nonrecursive_disable_wrapper return fn(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 940, in wrapper output = func(self, *args, **kwargs) File "/root/uni_grpo/unsloth_compiled_cache/unsloth_compiled_module_gpt_oss.py", line 542, in GptOssForCausalLM_forward outputs: MoeModelOutputWithPast = self.model( File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 1244, in forward hidden_states = decoder_layer( File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in call return super().call(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func return func(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 381, in forward hidden_states, _ = self.mlp(hidden_states) # diff with llama: router scores File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 643, in forward routed_out = self.experts(hidden_states, router_indices=router_indices, routing_weights=router_scores) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl return forward_call(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 503, in forward fused = swiglu_torch_forward(gate_up, self.alpha, self.limit, dtype = X_rep.dtype) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper return fn(*args, **kwargs) File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 55, in swiglu_torch_forward a_linear = a_linear.clamp(min=-limit, max=limit) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.50 GiB. GPU 0 has a total capacity of 178.35 GiB of which 9.29 GiB is free. Including non-PyTorch memory, this process has 169.05 GiB memory in use. Of the allocated memory 168.16 GiB is allocated by PyTorch, and 69.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 0%| | 0/4300 [00:02<?, ?it/s] (uni_grpo) root@gorgeous-chicken-of-unity:~/uni_grpo#

oss_120b_grpo.py.py

Contributor guide

Research direction: Inspect the OOM in the swiglu activation allocation. Try reducing `num generations` to 1, reducing `max seq length` to 2048, enabling Flash Attention 2 via `FA2=True`, or setting environment variable `PYTORCH CUDA ALLOC CONF=expandable segments:True`. Also consider using activation offloading or reducing the intermediate size of the swiglu layer.
Tech stack: pythonpytorch
Domain: machine learningai
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: PythonPyTorchCUDA
Newbie friendliness: 40