You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to write a simple script for training slat vae, but after performing a network parameter update, all the can numbers became 'nan'.
self.encoder = models.from_pretrained('JeffreyXiang/TRELLIS-image-large/ckpts/slat_enc_swin8_B_64l8_fp16')
self.decoder_gs = models.from_pretrained('JeffreyXiang/TRELLIS-image-large/ckpts/slat_dec_gs_swin8_B_64l8gs32_fp16')
lr = 1.0e-4
epochs = 100
optimizer = torch.optim.AdamW([
{'params': model.encoder.parameters(), 'lr': lr},
{'params': model.decoder_gs.parameters(), 'lr': lr}
])
for epoch in range(epochs):
for i, out in enumerate(train_loader):
with autograd.detect_anomaly():
feats = out['feats'].cuda()
coords = out['coords'].cuda()
renders = out['renders'].cuda()
extr = out['extr'].cuda()
intr = out['intr'].cuda()
res = model(feats, coords, renders, extr, intr)
loss = res['loss']
optimizer.zero_grad()
loss.backward()
optimizer.step()
This is the log of the gradient for the first iteration after loss.backward:
blocks.11.attn.to_qkv.weight grad mean: tensor(-0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_qkv.bias grad mean: tensor(6.1989e-06, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_out.weight grad mean: tensor(0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.attn.to_out.bias grad mean: tensor(1.7881e-07, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.0.weight grad mean: tensor(-0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.0.bias grad mean: tensor(1.3471e-05, device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.2.weight grad mean: tensor(0., device='cuda:0', dtype=torch.float16) dtype: torch.float16
blocks.11.mlp.mlp.2.bias grad mean: tensor(1.7881e-07, device='cuda:0', dtype=torch.float16) dtype: torch.float16
out_layer.weight grad mean: tensor(-1.2083e-11, device='cuda:0') dtype: torch.float32
out_layer.bias grad mean: tensor(0.0001, device='cuda:0') dtype: torch.float32
This is the log of the parameters for the first iteration after optimizer.step:
Parameter: blocks.11.attn.to_qkv.weight, weights: nan
Parameter: blocks.11.attn.to_qkv.bias, weights: nan
Parameter: blocks.11.attn.to_out.weight, weights: nan
Parameter: blocks.11.attn.to_out.bias, weights: nan
Parameter: blocks.11.mlp.mlp.0.weight, weights: nan
Parameter: blocks.11.mlp.mlp.0.bias, weights: nan
Parameter: blocks.11.mlp.mlp.2.weight, weights: nan
Parameter: blocks.11.mlp.mlp.2.bias, weights: nan
Parameter: out_layer.weight, weights: -0.0008287686505354941
Parameter: out_layer.bias, weights: -0.02195589430630207
The text was updated successfully, but these errors were encountered:
I tried to write a simple script for training slat vae, but after performing a network parameter update, all the can numbers became 'nan'.
This is the log of the gradient for the first iteration after loss.backward:
This is the log of the parameters for the first iteration after optimizer.step:
The text was updated successfully, but these errors were encountered: