unslothai/unsloth

[Bug] OOM when doing inference on any model using unsloth from v2025-01

Open

#2,590 opened on May 20, 2025

View on GitHub
 (6 comments) (0 reactions) (0 assignees)Python (64,271 stars) (5,658 forks)batch import
help wanted

Description

Hello guys,

Since I updated to the first version of 2025 and every other until now, I have the same issue.

I'm using a Jetson AGX Orin platform with 60Go of VRAM.

Initially, to make unsloth work on this device, I had to comment the following lines in the init file:

if DEVICE_TYPE == "cuda":
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = \
         "expandable_segments:True,"\
         "roundup_power2_divisions:[32:256,64:128,256:64,>:32]"

I'm using llama 3.3-70b model which is loaded correctly using this code:

vlm, processor = FastModel.from_pretrained(
    model_name = 
    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit",
    max_seq_length = any, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    #fast_inference = True,
    full_finetuning = False, # [NEW!] We have full finetuning now!
    device_map="cuda",# token = "hf_...", # use one if using gated models
    dtype=torch.bfloat16
)
                                       
FastModel.for_inference(vlm)

For the previous unsloth version, (<2025) only the initial ram reserved when loading the model was used (screen 1) to do inference. It works perfectly and does not need more ram when calling the model generation. Image

From v2025+ (including the last release of may) when I'm trying to generate something from **any model **, there is a huge increase in memory allocation like if it was loading the model a second time which is causing my Jetson to craft because of OOM. (screen2)

Image

To be sure I downgraded to the last version of 2024 (unsloth and unsloth-zoo) and it works perfectly using the same code.

Do you have any ideas of what could be the root cause ?

Here is the summary of what package I'm using : (I insist on the fact that everything is working on older version of unsloth even if I'm using torch2.6 and cuda 12.8)

Image

Contributor guide

[Bug] OOM when doing inference on any model using unsloth from v2025-01 · unslothai/unsloth#2590 | Good First Issue