Description
Here is the full log (I added memory debugging too manually). I cannot get GRPO to work and docs say it should fit in 140GB and I have 183GB. It seems mostly to do with KV cache but this seems insane its already at only 4k context length.
(uni_grpo) root@gorgeous-chicken-of-unity:~/uni_grpo# python3 train.py 🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! [WARNING] HF_TOKEN not set. Uploads skipped. [GPU] NVIDIA B200 [GPU] Total VRAM: 178.35 GB
[VRAM] Initial: 0.00 GB alloc, 0.00 GB reserved, 178.35 GB free [model] Loading unsloth/gpt-oss-120b, max_seq_length=4608 ==((====))== Unsloth 2025.10.1: Fast Gpt_Oss patching. Transformers: 4.56.2. \ /| NVIDIA B200. Num GPUs = 1. Max memory: 178.351 GB. Platform: Linux. O^O/ _/ \ Torch: 2.8.0+cu128. CUDA: 10.0. CUDA Toolkit: 12.8. Triton: 3.4.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33+5146f2a.d20251005. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 16/16 [00:29<00:00, 1.85s/it] Unsloth: Offloading embeddings to RAM to save 1.08 GB. [VRAM] After base model load: 56.85 GB alloc, 56.88 GB reserved, 121.47 GB free
[generation_config] Creating explicit generation config [generation_config] Setting max_length=4608 [generation_config] Setting max_new_tokens=512 [generation_config] Verified model.generation_config.max_length = 4608 [generation_config] Verified model.generation_config.max_new_tokens = 512
[VRAM] After for_inference: 56.85 GB alloc, 56.88 GB reserved, 121.47 GB free [lora] Loading adapter from grpo-adapter-step-20 [VRAM] After LoRA adapter load: 56.94 GB alloc, 57.06 GB reserved, 121.29 GB free [lora] Trainable: 23,887,872 / 59,044,394,304 (0.0405%)
[VRAM] After gradient checkpointing: 56.94 GB alloc, 57.06 GB reserved, 121.29 GB free [VRAM] After GC: 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free [VRAM] After dataset load: 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free - 17200 samples
[config] per_device_batch_size: 1 [config] gradient_accumulation: 4 [config] num_generations: 2 [config] Effective batch: 4 [config] Completions in memory: 2
Unsloth: We now expect per_device_train_batch_size to be a multiple of num_generations.
We will change the batch size of 1 to the num_generations of 2
[VRAM] After trainer init: 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free
[generation_config] Final check before training:
- model.generation_config.max_length: 4608
- model.generation_config.max_new_tokens: 512
[VRAM] Before training (post-GC): 56.94 GB alloc, 56.97 GB reserved, 121.38 GB free
================================================================================ STARTING TRAINING
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}. ==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \ /| Num examples = 17,200 | Num Epochs = 1 | Total steps = 4,300 O^O/ _/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 23,887,872 of 116,853,044,544 (0.02% trained) 0%| | 0/4300 [00:00<?, ?it/s]
ENTERING FIRST TRAINING STEP - DETAILED MEMORY TRACKING
[VRAM] Step 0: Before generation starts: 58.03 GB alloc, 58.07 GB reserved, 120.28 GB free
[generation_config] max_length: 4608 [generation_config] max_new_tokens: 512
[MEMORY CALC] Generation parameters:
- Batch size: 2
- Num generations: 2
- Total sequences: 4
- Max length per sequence: 4608
- Total tokens to generate: 18432
[MEMORY CALC] Estimated KV cache: 67.50 GB
- Formula: 2 × 80 layers × 96 heads × 128 dim × 4608 tokens × 4 seqs × 2 bytes [MEMORY CALC] Estimated swiglu activations: 2.25 GB
- Formula: 4 seqs × 4608 tokens × 32768 intermediate × 2 × 2 bytes
[MEMORY CALC] Total estimated for generation: 69.75 GB
generation_config default values have been modified to match model-specific defaults: {'max_length': 4608}. If this is not desired, please set these values explicitly.
Traceback (most recent call last):
File "/root/uni_grpo/train.py", line 628, in
main()
File "/root/uni_grpo/train.py", line 613, in main
train_result = trainer.train()
File "/root/uni_grpo/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 53, in wrapper
output = f(self, *args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in train
return inner_training_loop(
File "", line 323, in _fast_inner_training_loop
File "", line 34, in _unsloth_training_step
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/trl/extras/profiling.py", line 98, in wrapper
return func(self, *args, **kwargs)
File "/root/uni_grpo/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 2015, in _prepare_inputs
generation_batch = self._generate_and_score_completions(generation_batch)
File "/root/uni_grpo/unsloth_compiled_cache/UnslothGRPOTrainer.py", line 2323, in _generate_and_score_completions
prompt_completion_ids = unwrapped_model.generate(
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth/models/rl.py", line 71, in generate_with_clone
out = original_generate(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/peft/peft_model.py", line 1973, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth/models/vision.py", line 266, in unsloth_base_fast_generate
output = self._old_generate(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2539, in generate
result = self._sample(
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2867, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/root/uni_grpo/unsloth_compiled_cache/unsloth_compiled_module_gpt_oss.py", line 721, in forward
return GptOssForCausalLM_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_router_logits, cache_position, logits_to_keep, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 198, in nonrecursive_disable_wrapper
return fn(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 940, in wrapper
output = func(self, *args, **kwargs)
File "/root/uni_grpo/unsloth_compiled_cache/unsloth_compiled_module_gpt_oss.py", line 542, in GptOssForCausalLM_forward
outputs: MoeModelOutputWithPast = self.model(
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 1244, in forward
hidden_states = decoder_layer(
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/modeling_layers.py", line 94, in call
return super().call(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py", line 381, in forward
hidden_states, _ = self.mlp(hidden_states) # diff with llama: router scores
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 643, in forward
routed_out = self.experts(hidden_states, router_indices=router_indices, routing_weights=router_scores)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 503, in forward
fused = swiglu_torch_forward(gate_up, self.alpha, self.limit, dtype = X_rep.dtype)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
return fn(*args, **kwargs)
File "/root/uni_grpo/.venv/lib/python3.10/site-packages/unsloth_zoo/temporary_patches/gpt_oss.py", line 55, in swiglu_torch_forward
a_linear = a_linear.clamp(min=-limit, max=limit)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.50 GiB. GPU 0 has a total capacity of 178.35 GiB of which 9.29 GiB is free. Including non-PyTorch memory, this process has 169.05 GiB memory in use. Of the allocated memory 168.16 GiB is allocated by PyTorch, and 69.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
0%| | 0/4300 [00:02<?, ?it/s]
(uni_grpo) root@gorgeous-chicken-of-unity:~/uni_grpo#