[Bug] Loss & gradient norm zero when full finetuning custom model & most layers frozen · unslothai/unsloth#3054

(1 commento) (0 reazioni) (0 assegnatari)Python (5658 fork)batch import

help wanted

Metriche repository

Star: (64.271 star)
Metriche merge PR: (Merge medio 3g 15h) (525 PR mergiate in 30 g)

Descrizione

Did you update? pip install --upgrade unsloth unsloth_zoo ✅
[local]
Number GPUs used (1), use nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   38C    P2             142W / 300W |   8590MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     50690      G   /usr/libexec/Xorg                             8MiB |
|    0   N/A  N/A   2205206      C   python                                     8566MiB |
+---------------------------------------------------------------------------------------+

Which notebook? Please link!
Which Unsloth version, TRL version, transformers version, PyTorch version?

==((====))==  Unsloth 2025.7.8: Fast Siglip patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth

Which trainer? [SFTTrainer]

I'm training a custom architecture that resembles LoRA with a learnable scaling factor ($\alpha/r$).

Each linear layer in a normal model (I'm using Qwen2.5 3B Instruct) is replaced with something where the forward pass looks like:

def forward(self, x):
  scale = F.sigmoid(x @ self.gate)
  return x @ base_layer + scale * (lora_B(lora_A(x))

The only layers that have _requires_grad = True are the gates. The base linear layers, the loras, etc. are all frozen.

I'm using full_finetuning=True. Occurs with and without gradient checkpointing set.

Guida contributor

Direzione di ricerca: Esamina il forward pass del layer gate personalizzato. Controlla se il vettore scale causa problemi di flusso del gradiente. Verifica anche che full finetuning=True gestisca correttamente i parametri del modulo personalizzato.
Tech stack: pythonpytorch
Dominio: backendai
Tipo issue: Bug
Difficoltà: 3
Tempo stimato: Mezza giornata
Stato attività: Attiva
Chiarezza: Chiara
Prerequisiti: PyTorchGradient Flow
Adatta ai principianti: 40

Metriche repository

Descrizione

Guida contributor

Ricevi issue Easy fresche nella tua inbox.