Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training loss is 'nan' #75

Closed
008karan opened this issue Nov 15, 2019 · 4 comments
Closed

training loss is 'nan' #75

008karan opened this issue Nov 15, 2019 · 4 comments

Comments

@008karan
Copy link

In training i am getting only nan loss. after 40 epoch also no update.

epoch 0, loss_tr=nan err_tr=1.000000 loss_te=nan err_te=1.000000 err_te_snt=1.000000
epoch 8, loss_tr=nan err_tr=1.000000 loss_te=nan err_te=1.000000 err_te_snt=1.000000
epoch 16, loss_tr=nan err_tr=1.000000 loss_te=nan err_te=1.000000 err_te_snt=1.000000
epoch 24, loss_tr=nan err_tr=1.000000 loss_te=nan err_te=1.000000 err_te_snt=1.000000
epoch 32, loss_tr=nan err_tr=1.000000 loss_te=nan err_te=1.000000 err_te_snt=1.000000
epoch 40, loss_tr=nan err_tr=1.000000 loss_te=nan err_te=1.000000 err_te_snt=1.000000
epoch 0, loss_tr=nan err_tr=0.998643 loss_te=nan err_te=0.998567 err_te_snt=0.998567
epoch 0, loss_tr=nan err_tr=0.998643 loss_te=nan err_te=0.998567 err_te_snt=0.998567
epoch 8, loss_tr=nan err_tr=0.998555 loss_te=nan err_te=0.998567 err_te_snt=0.998567
epoch 16, loss_tr=nan err_tr=0.998525 loss_te=nan err_te=0.998567 err_te_snt=0.998567
epoch 24, loss_tr=nan err_tr=0.998604 loss_te=nan err_te=0.998567 err_te_snt=0.998567
epoch 32, loss_tr=nan err_tr=0.998457 loss_te=nan err_te=0.998567 err_te_snt=0.998567
epoch 40, loss_tr=nan err_tr=0.998613 loss_te=nan err_te=0.998567 err_te_snt=0.998567

after debugging I found training batches getting generated as I printed their tensor. but when i am printing pout its nan

  for i in range(N_batches):

    [inp,lab]=create_batches_rnd(batch_size,data_folder,wav_lst_tr,snt_tr,wlen,lab_dict,0.2)
    pout=DNN2_net(DNN1_net(CNN_net(inp)))
    
    pred=torch.max(pout,dim=1)[1]
    loss = cost(pout, lab.long())
    err = torch.mean((pred!=lab.long()).float())
    print('***********',pout)

output:

        [-6.4934, -6.5646, -6.5842,  ..., -6.5785, -6.5211, -6.5626],
        [-6.5141, -6.5833, -6.5316,  ..., -6.6234, -6.4951, -6.5934],
        ...,
        [-6.5460, -6.5525, -6.5630,  ..., -6.5581, -6.5142, -6.5896],
        [-6.4957, -6.5235, -6.4879,  ..., -6.6145, -6.5316, -6.6193],
        [-6.5091, -6.5749, -6.5799,  ..., -6.5857, -6.5860, -6.6161]],
       device='cuda:0', grad_fn=<LogSoftmaxBackward>)
*********** tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<LogSoftmaxBackward>)
*********** tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<LogSoftmaxBackward>)

for 1st file its showing the tensor but after its nan. very weird behaviour.
My dataset contain 10 audio files of 10sec length for each speaker.
please help!

@008karan
Copy link
Author

When increased the chunking size from 200ms to 600ms training started correctly. Below 600ms its giving nan and above 700ms range becoming zero in accessing random chunk as snt_len become less than wlen .
Can you tell why so @mravanelli

@mravanelli
Copy link
Owner

It looks like a gradient issue. Sometimes one can solve them just by adding gradient clipping.

@AnnaWang288
Copy link

I encountered the same problem, how did you solve it in the end?

@natank1
Copy link

natank1 commented Nov 27, 2020

I encountered this . See #102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants