unslothai/unsloth
View on GitHub[Bug] Loss & gradient norm zero when full finetuning custom model & most layers frozen
Open
#3,054 opened on Jul 29, 2025
help wanted
Description
- Did you update?
pip install --upgrade unsloth unsloth_zoo✅ - [local]
- Number GPUs used (1), use
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:61:00.0 Off | Off |
| 30% 38C P2 142W / 300W | 8590MiB / 49140MiB | 32% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 50690 G /usr/libexec/Xorg 8MiB |
| 0 N/A N/A 2205206 C python 8566MiB |
+---------------------------------------------------------------------------------------+
- Which notebook? Please link!
- Which Unsloth version, TRL version, transformers version, PyTorch version?
==((====))== Unsloth 2025.7.8: Fast Siglip patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
\\ /| NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
"-____-" Free license: http://github.com/unslothai/unsloth
- Which trainer? [
SFTTrainer]
I'm training a custom architecture that resembles LoRA with a learnable scaling factor ($\alpha/r$).
Each linear layer in a normal model (I'm using Qwen2.5 3B Instruct) is replaced with something where the forward pass looks like:
def forward(self, x):
scale = F.sigmoid(x @ self.gate)
return x @ base_layer + scale * (lora_B(lora_A(x))
The only layers that have _requires_grad = True are the gates. The base linear layers, the loras, etc. are all frozen.
I'm using full_finetuning=True. Occurs with and without gradient checkpointing set.