unslothai/unsloth

[Bug] Loss & gradient norm zero when full finetuning custom model & most layers frozen

Open

#3,054 opened on Jul 29, 2025

View on GitHub
 (1 comment) (0 reactions) (0 assignees)Python (64,271 stars) (5,658 forks)batch import
help wanted

Description

  1. Did you update? pip install --upgrade unsloth unsloth_zoo
  2. [local]
  3. Number GPUs used (1), use nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   38C    P2             142W / 300W |   8590MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     50690      G   /usr/libexec/Xorg                             8MiB |
|    0   N/A  N/A   2205206      C   python                                     8566MiB |
+---------------------------------------------------------------------------------------+
  1. Which notebook? Please link!
  2. Which Unsloth version, TRL version, transformers version, PyTorch version?
==((====))==  Unsloth 2025.7.8: Fast Siglip patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
  1. Which trainer? [SFTTrainer]

I'm training a custom architecture that resembles LoRA with a learnable scaling factor ($\alpha/r$).

Each linear layer in a normal model (I'm using Qwen2.5 3B Instruct) is replaced with something where the forward pass looks like:

def forward(self, x):
  scale = F.sigmoid(x @ self.gate)
  return x @ base_layer + scale * (lora_B(lora_A(x))

The only layers that have _requires_grad = True are the gates. The base linear layers, the loras, etc. are all frozen.

I'm using full_finetuning=True. Occurs with and without gradient checkpointing set.

Contributor guide