unslothai/unsloth

[Bug] Loss & gradient norm zero when full finetuning custom model & most layers frozen

Open

#3054 aperta il 29 lug 2025

Vedi su GitHub
 (1 commento) (0 reazioni) (0 assegnatari)Python (5658 fork)batch import
help wanted

Metriche repository

Star
 (64.271 star)
Metriche merge PR
 (Merge medio 3g 15h) (525 PR mergiate in 30 g)

Descrizione

  1. Did you update? pip install --upgrade unsloth unsloth_zoo
  2. [local]
  3. Number GPUs used (1), use nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   38C    P2             142W / 300W |   8590MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     50690      G   /usr/libexec/Xorg                             8MiB |
|    0   N/A  N/A   2205206      C   python                                     8566MiB |
+---------------------------------------------------------------------------------------+
  1. Which notebook? Please link!
  2. Which Unsloth version, TRL version, transformers version, PyTorch version?
==((====))==  Unsloth 2025.7.8: Fast Siglip patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
  1. Which trainer? [SFTTrainer]

I'm training a custom architecture that resembles LoRA with a learnable scaling factor ($\alpha/r$).

Each linear layer in a normal model (I'm using Qwen2.5 3B Instruct) is replaced with something where the forward pass looks like:

def forward(self, x):
  scale = F.sigmoid(x @ self.gate)
  return x @ base_layer + scale * (lora_B(lora_A(x))

The only layers that have _requires_grad = True are the gates. The base linear layers, the loras, etc. are all frozen.

I'm using full_finetuning=True. Occurs with and without gradient checkpointing set.

Guida contributor