nan in my own training scripts. #162

Rody1911641 · 2025-01-26T08:40:56Z

I tried to write a simple script for training slat vae, but after performing a network parameter update, all the can numbers became 'nan'.

self.encoder = models.from_pretrained('JeffreyXiang/TRELLIS-image-large/ckpts/slat_enc_swin8_B_64l8_fp16')
self.decoder_gs = models.from_pretrained('JeffreyXiang/TRELLIS-image-large/ckpts/slat_dec_gs_swin8_B_64l8gs32_fp16')
    lr = 1.0e-4
    epochs = 100 
    optimizer = torch.optim.AdamW([
	    {'params': model.encoder.parameters(), 'lr': lr}, 
	    {'params': model.decoder_gs.parameters(), 'lr': lr}
	])
    for epoch in range(epochs):
        for i, out in enumerate(train_loader):
            with autograd.detect_anomaly():
                feats = out['feats'].cuda()
                coords = out['coords'].cuda()
                renders = out['renders'].cuda()
                extr = out['extr'].cuda()
                intr = out['intr'].cuda()
                
                res = model(feats, coords, renders, extr, intr)
                loss = res['loss']

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

This is the log of the gradient for the first iteration after loss.backward:

blocks.11.attn.to_qkv.weight grad mean: tensor(-0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_qkv.bias grad mean: tensor(6.1989e-06, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_out.weight grad mean: tensor(0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_out.bias grad mean: tensor(1.7881e-07, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.0.weight grad mean: tensor(-0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.0.bias grad mean: tensor(1.3471e-05, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.2.weight grad mean: tensor(0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.2.bias grad mean: tensor(1.7881e-07, device='cuda:0', dtype=torch.float16) dtype: torch.float16
out_layer.weight grad mean: tensor(-1.2083e-11, device='cuda:0') dtype: torch.float32
out_layer.bias grad mean: tensor(0.0001, device='cuda:0') dtype: torch.float32

This is the log of the parameters for the first iteration after optimizer.step:

Parameter: blocks.11.attn.to_qkv.weight, weights: nan
Parameter: blocks.11.attn.to_qkv.bias, weights: nan
Parameter: blocks.11.attn.to_out.weight, weights: nan
Parameter: blocks.11.attn.to_out.bias, weights: nan
Parameter: blocks.11.mlp.mlp.0.weight, weights: nan
Parameter: blocks.11.mlp.mlp.0.bias, weights: nan
Parameter: blocks.11.mlp.mlp.2.weight, weights: nan
Parameter: blocks.11.mlp.mlp.2.bias, weights: nan
Parameter: out_layer.weight, weights: -0.0008287686505354941
Parameter: out_layer.bias, weights: -0.02195589430630207

The text was updated successfully, but these errors were encountered:

ivand-all3d · 2025-01-27T20:51:57Z

I encountered a similar issue, check out this post on the pytorch forums: https://discuss.pytorch.org/t/adam-half-precision-nans/1765.

Rody1911641 · 2025-02-07T03:36:13Z

@ivand-all3d Thank you very much. Your solution is very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan in my own training scripts. #162

nan in my own training scripts. #162

Rody1911641 commented Jan 26, 2025 •

edited

Loading

ivand-all3d commented Jan 27, 2025

Rody1911641 commented Feb 7, 2025

nan in my own training scripts. #162

nan in my own training scripts. #162

Comments

Rody1911641 commented Jan 26, 2025 • edited Loading

ivand-all3d commented Jan 27, 2025

Rody1911641 commented Feb 7, 2025

Rody1911641 commented Jan 26, 2025 •

edited

Loading