[Bug] OOM when doing inference on any model using unsloth from v2025-01
#2,590 opened on May 20, 2025
Description
Hello guys,
Since I updated to the first version of 2025 and every other until now, I have the same issue.
I'm using a Jetson AGX Orin platform with 60Go of VRAM.
Initially, to make unsloth work on this device, I had to comment the following lines in the init file:
if DEVICE_TYPE == "cuda":
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = \
"expandable_segments:True,"\
"roundup_power2_divisions:[32:256,64:128,256:64,>:32]"
I'm using llama 3.3-70b model which is loaded correctly using this code:
vlm, processor = FastModel.from_pretrained(
model_name =
"unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
max_seq_length = any, # Choose any for long context!
load_in_4bit = True, # 4 bit quantization to reduce memory
load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
#fast_inference = True,
full_finetuning = False, # [NEW!] We have full finetuning now!
device_map="cuda",# token = "hf_...", # use one if using gated models
dtype=torch.bfloat16
)
FastModel.for_inference(vlm)
For the previous unsloth version, (<2025) only the initial ram reserved when loading the model was used (screen 1) to do inference. It works perfectly and does not need more ram when calling the model generation.
From v2025+ (including the last release of may) when I'm trying to generate something from **any model **, there is a huge increase in memory allocation like if it was loading the model a second time which is causing my Jetson to craft because of OOM. (screen2)
To be sure I downgraded to the last version of 2024 (unsloth and unsloth-zoo) and it works perfectly using the same code.
Do you have any ideas of what could be the root cause ?
Here is the summary of what package I'm using : (I insist on the fact that everything is working on older version of unsloth even if I'm using torch2.6 and cuda 12.8)