[Bug] Loss & gradient norm zero when full finetuning custom model & most layers frozen · unslothai/unsloth#3054

(1 commentaire) (0 réactions) (0 assignés)Python (5 658 forks)batch import

help wanted

Métriques du dépôt

Stars: (64 271 stars)
Métriques de merge PR: (Merge moyen 3j 15h) (525 PRs mergées en 30 j)

Description

Did you update? pip install --upgrade unsloth unsloth_zoo ✅
[local]
Number GPUs used (1), use nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   38C    P2             142W / 300W |   8590MiB / 49140MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     50690      G   /usr/libexec/Xorg                             8MiB |
|    0   N/A  N/A   2205206      C   python                                     8566MiB |
+---------------------------------------------------------------------------------------+

Which notebook? Please link!
Which Unsloth version, TRL version, transformers version, PyTorch version?

==((====))==  Unsloth 2025.7.8: Fast Siglip patching. Transformers: 4.53.2. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA RTX A6000. Num GPUs = 1. Max memory: 47.536 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = True]
 "-____-"     Free license: http://github.com/unslothai/unsloth

Which trainer? [SFTTrainer]

I'm training a custom architecture that resembles LoRA with a learnable scaling factor ($\alpha/r$).

Each linear layer in a normal model (I'm using Qwen2.5 3B Instruct) is replaced with something where the forward pass looks like:

def forward(self, x):
  scale = F.sigmoid(x @ self.gate)
  return x @ base_layer + scale * (lora_B(lora_A(x))

The only layers that have _requires_grad = True are the gates. The base linear layers, the loras, etc. are all frozen.

I'm using full_finetuning=True. Occurs with and without gradient checkpointing set.

Guide contributeur

Direction de recherche: Examinez le passage direct de la couche de porte personnalisée. Vérifiez si le vecteur d'échelle cause des problèmes de flux de gradient. Assurez vous également que full finetuning=True gère correctement les paramètres du module personnalisé.
Stack technique: pythonpytorch
Domaine: backendai
Type d'issue: Bug
Difficulté: 3
Temps estimé: Une demi journée
Statut d'activité: Active
Clarté: Claire
Prérequis: PyTorchGradient Flow
Accessibilité débutant: 40

Métriques du dépôt

Description

Guide contributeur

Recevez de nouvelles issues Easy par e-mail.