Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan in my own training scripts. #162

Open
Rody1911641 opened this issue Jan 26, 2025 · 2 comments
Open

nan in my own training scripts. #162

Rody1911641 opened this issue Jan 26, 2025 · 2 comments

Comments

@Rody1911641
Copy link

Rody1911641 commented Jan 26, 2025

I tried to write a simple script for training slat vae, but after performing a network parameter update, all the can numbers became 'nan'.

self.encoder = models.from_pretrained('JeffreyXiang/TRELLIS-image-large/ckpts/slat_enc_swin8_B_64l8_fp16')
self.decoder_gs = models.from_pretrained('JeffreyXiang/TRELLIS-image-large/ckpts/slat_dec_gs_swin8_B_64l8gs32_fp16')
    lr = 1.0e-4
    epochs = 100 
    optimizer = torch.optim.AdamW([
	    {'params': model.encoder.parameters(), 'lr': lr}, 
	    {'params': model.decoder_gs.parameters(), 'lr': lr}
	])
    for epoch in range(epochs):
        for i, out in enumerate(train_loader):
            with autograd.detect_anomaly():
                feats = out['feats'].cuda()
                coords = out['coords'].cuda()
                renders = out['renders'].cuda()
                extr = out['extr'].cuda()
                intr = out['intr'].cuda()
                
                res = model(feats, coords, renders, extr, intr)
                loss = res['loss']

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

This is the log of the gradient for the first iteration after loss.backward:

blocks.11.attn.to_qkv.weight grad mean: tensor(-0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_qkv.bias grad mean: tensor(6.1989e-06, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_out.weight grad mean: tensor(0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_out.bias grad mean: tensor(1.7881e-07, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.0.weight grad mean: tensor(-0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.0.bias grad mean: tensor(1.3471e-05, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.2.weight grad mean: tensor(0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.2.bias grad mean: tensor(1.7881e-07, device='cuda:0', dtype=torch.float16) dtype: torch.float16
out_layer.weight grad mean: tensor(-1.2083e-11, device='cuda:0') dtype: torch.float32
out_layer.bias grad mean: tensor(0.0001, device='cuda:0') dtype: torch.float32

This is the log of the parameters for the first iteration after optimizer.step:

Parameter: blocks.11.attn.to_qkv.weight, weights: nan
Parameter: blocks.11.attn.to_qkv.bias, weights: nan
Parameter: blocks.11.attn.to_out.weight, weights: nan
Parameter: blocks.11.attn.to_out.bias, weights: nan
Parameter: blocks.11.mlp.mlp.0.weight, weights: nan
Parameter: blocks.11.mlp.mlp.0.bias, weights: nan
Parameter: blocks.11.mlp.mlp.2.weight, weights: nan
Parameter: blocks.11.mlp.mlp.2.bias, weights: nan
Parameter: out_layer.weight, weights: -0.0008287686505354941
Parameter: out_layer.bias, weights: -0.02195589430630207
@ivand-all3d
Copy link

I encountered a similar issue, check out this post on the pytorch forums: https://discuss.pytorch.org/t/adam-half-precision-nans/1765.

@Rody1911641
Copy link
Author

@ivand-all3d Thank you very much. Your solution is very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants