[Bug] OOM when doing inference on any model using unsloth from v2025-01 · unslothai/unsloth#2590

Repository metrics

Stars: (64,271 stars)
PR merge metrics: (Avg merge 3d 15h) (525 merged PRs in 30d)

Description

Hello guys,

Since I updated to the first version of 2025 and every other until now, I have the same issue.

I'm using a Jetson AGX Orin platform with 60Go of VRAM.

Initially, to make unsloth work on this device, I had to comment the following lines in the init file:

if DEVICE_TYPE == "cuda":
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = \
         "expandable_segments:True,"\
         "roundup_power2_divisions:[32:256,64:128,256:64,>:32]"

I'm using llama 3.3-70b model which is loaded correctly using this code:

vlm, processor = FastModel.from_pretrained(
    model_name = 
    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
    max_seq_length = any, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    #fast_inference = True,
    full_finetuning = False, # [NEW!] We have full finetuning now!
    device_map="cuda",# token = "hf_...", # use one if using gated models
    dtype=torch.bfloat16
)
                                       
FastModel.for_inference(vlm)

For the previous unsloth version, (<2025) only the initial ram reserved when loading the model was used (screen 1) to do inference. It works perfectly and does not need more ram when calling the model generation.

From v2025+ (including the last release of may) when I'm trying to generate something from **any model **, there is a huge increase in memory allocation like if it was loading the model a second time which is causing my Jetson to craft because of OOM. (screen2)

To be sure I downgraded to the last version of 2024 (unsloth and unsloth-zoo) and it works perfectly using the same code.

Do you have any ideas of what could be the root cause ?

Here is the summary of what package I'm using : (I insist on the fact that everything is working on older version of unsloth even if I'm using torch2.6 and cuda 12.8)

Contributor guide

Research direction: Investigate memory allocation differences between unsloth v2024 and v2025, focusing on the 'expandable segments' PyTorch allocator configuration and device map settings during inference.
Tech stack: pythonpytorch
Domain: backendmachine learningperformance
Issue type: Bug
Difficulty: 3
Estimated time: 1-2 days
Activity status: Active
Clarity: Clear
Prerequisites: PythonPyTorchCUDA
Newbie friendliness: 40

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.