Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The training result is blank #11

Open
wang674 opened this issue Nov 13, 2022 · 12 comments
Open

The training result is blank #11

wang674 opened this issue Nov 13, 2022 · 12 comments

Comments

@wang674
Copy link

wang674 commented Nov 13, 2022

The training result is blank。
image

@ShuaiBai623
Copy link
Collaborator

Are there screenshots of the training process?

@wang674
Copy link
Author

wang674 commented Nov 14, 2022 via email

@kanthprashant
Copy link

Hi @wang674 ,

Have you been able to solve this problem? I am encountering a similar issue while fine-tuning model on a custom dataset. The model produces the expected output until epoch 6, but afterwards, it begins to generate blank outputs.

@kanthprashant
Copy link

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

@hyyuan123
Copy link

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good
225506884-9522cd05-dc8d-4d1b-bbad-7739545b4e9f

@kanthprashant
Copy link

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good 225506884-9522cd05-dc8d-4d1b-bbad-7739545b4e9f

Hi @hyyuan123 ,
Yes, I was able to get considerably good results, using the same training code.

@hyyuan123
Copy link

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good 225506884-9522cd05-dc8d-4d1b-bbad-7739545b4e9f

Hi @hyyuan123 , Yes, I was able to get considerably good results, using the same training code.

@kanthprashant
Thank you for your reply. I'll try again

@1BTU
Copy link

1BTU commented Jun 3, 2023

The training result is blank。 image

Hello,I had a similar problem with you,I used my equipment to train the weight given by the git_hub of author .However ,after many rounds of training,even though my learning rate had been modified very small,the predicted result was still gray and white .Later,I found that my code had two problems:
the first one was when the model was saved:

 torch.save(
            {
                "state_dict": sdafnet.state_dict(),
            },
            "savemodel.pt",
        )
    torch.save(sdafnet.state_dict(), "savemodel.pt")

In my code ,I save the model in dictionary form,but load the prediction directly ,that is , load it into the network by net.load_state_dict(), which leads to inaccurate prediction results. Here I save the weights by the following method:

torch.save(net.state_dict(),save_path)

The second reason is this:
I'm going to comment out the sdafnet = torch.nn.DataParallel(sdafnet, device_ids=range(torch.cuda.device_count())),save the model ,and it's going to to be good,for the following reasons:

nn.DataParallel is a module used for parallel computing on multiple GPUs. It can replicate a model to multiple GPUs and execute forward and backward propagation of input data in parallel. If you only have one GPU, it is not necessary to use nn.DataParallel.

If you use nn.DataParallel in your code to load a trained model weight and perform prediction on a single GPU, it may cause inaccurate, unstable, or even strange errors in the predicted results. This is because during the execution of nn.DataParallel, the model is replicated to multiple GPUs, and when predicting, only one GPU is used, which may result in different predicted results from the original model on a single GPU. Additionally, since nn.DataParallel divides the input data into multiple small batches for processing, this also affects the predicted results.

The solution is to directly load the weights on a single GPU when loading the model weights, rather than loading the weights on multiple GPUs. If you need to train using multiple GPUs, you can use nn.DataParallel in the training code, but remove it when performing validation or testing, and only use a single GPU for prediction.

@xxxxl888
Copy link

xxxxl888 commented Jun 5, 2023

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题?

@1BTU
Copy link

1BTU commented Jun 18, 2023

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题?
You can read my response above ,may be it can help u.

@xxxxl888
Copy link

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题?
You can read my response above ,may be it can help u.
你好,感谢你的提醒,我想再请教一个问题,我该如何计算FID和SSIM的分数?用哪两个数据集或者图片?

@xxxxl888
Copy link

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题?
You can read my response above ,may be it can help u.

Hello, would it be okay if I ask you a few questions? I'm wondering if it's normal for the first training result image I saved during the training process to be blank. Also, I noticed that the images you showed above had keypoint maps and clothing segmentation results, but my saved images don't seem to have those. Is there something I'm missing or doing wrong?And I have looked at your solution, but it still doesn't clearly solve my problem。
Uploading 微信图片_20231215213524.jpg…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants