unslothai/unsloth

[Bug] Loss & gradient norm zero when full finetuning custom model & most layers frozen

Open

#3.054 aberto em 29 de jul. de 2025

Ver no GitHub
 (1 comment) (0 reactions) (0 assignees)Python (5.658 forks)batch import
help wanted

Métricas do repositório

Stars
 (64.271 stars)
Métricas de merge de PR
 (Mesclagem média 3d 15h) (525 fundiu PRs em 30d)

Description

  1. Did you update? pip install --upgrade unsloth unsloth_zoo
  2. [local]
  3. Number GPUs used (1), use nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   38C    P2             142W / 300W |   8590MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     50690      G   /usr/libexec/Xorg                             8MiB |
|    0   N/A  N/A   2205206      C   python                                     8566MiB |
+---------------------------------------------------------------------------------------+
  1. Which notebook? Please link!
  2. Which Unsloth version, TRL version, transformers version, PyTorch version?
==((====))==  Unsloth 2025.7.8: Fast Siglip patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth
  1. Which trainer? [SFTTrainer]

I'm training a custom architecture that resembles LoRA with a learnable scaling factor ($\alpha/r$).

Each linear layer in a normal model (I'm using Qwen2.5 3B Instruct) is replaced with something where the forward pass looks like:

def forward(self, x):
  scale = F.sigmoid(x @ self.gate)
  return x @ base_layer + scale * (lora_B(lora_A(x))

The only layers that have _requires_grad = True are the gates. The base linear layers, the loras, etc. are all frozen.

I'm using full_finetuning=True. Occurs with and without gradient checkpointing set.

Guia do colaborador