[Bug] Loss & gradient norm zero when full finetuning custom model & most layers frozen · unslothai/unsloth#3054

(1 comment) (0 reactions) (0 assignees)Python (5,658 forks)batch import

help wanted

Repository metrics

Stars: (64,271 stars)
PR merge metrics: (Avg merge 3d 15h) (525 merged PRs in 30d)

Description

Did you update? pip install --upgrade unsloth unsloth_zoo ✅
[local]
Number GPUs used (1), use nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   38C    P2             142W / 300W |   8590MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     50690      G   /usr/libexec/Xorg                             8MiB |
|    0   N/A  N/A   2205206      C   python                                     8566MiB |
+---------------------------------------------------------------------------------------+

Which notebook? Please link!
Which Unsloth version, TRL version, transformers version, PyTorch version?

==((====))==  Unsloth 2025.7.8: Fast Siglip patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth

Which trainer? [SFTTrainer]

I'm training a custom architecture that resembles LoRA with a learnable scaling factor ($\alpha/r$).

Each linear layer in a normal model (I'm using Qwen2.5 3B Instruct) is replaced with something where the forward pass looks like:

def forward(self, x):
  scale = F.sigmoid(x @ self.gate)
  return x @ base_layer + scale * (lora_B(lora_A(x))

The only layers that have _requires_grad = True are the gates. The base linear layers, the loras, etc. are all frozen.

I'm using full_finetuning=True. Occurs with and without gradient checkpointing set.

Contributor guide

Research direction: Examine the forward pass of the custom gate layer. Check if the scale vector is causing gradient flow issues. Also verify that full finetuning=True correctly handles the custom module's parameters.
Tech stack: pythonpytorch
Domain: backendai
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: PyTorchGradient Flow
Newbie friendliness: 40

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.