The training result is blank #11

wang674 · 2022-11-13T12:24:30Z

The training result is blank。

ShuaiBai623 · 2022-11-14T07:47:56Z

Are there screenshots of the training process?

wang674 · 2022-11-14T10:30:11Z

No error

------------------ 原始邮件 ------------------ 发件人: "OFA-Sys/DAFlow" ***@***.***>; 发送时间: 2022年11月14日(星期一) 下午3:48 ***@***.***>; ***@***.******@***.***>; 主题: Re: [OFA-Sys/DAFlow] The training result is blank (Issue #11) Are there screenshots of the training process? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

kanthprashant · 2023-04-09T13:37:13Z

Hi @wang674 ,

Have you been able to solve this problem? I am encountering a similar issue while fine-tuning model on a custom dataset. The model produces the expected output until epoch 6, but afterwards, it begins to generate blank outputs.

kanthprashant · 2023-04-13T04:01:25Z

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

hyyuan123 · 2023-04-25T07:27:11Z

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good

kanthprashant · 2023-04-25T12:43:12Z

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good

Hi @hyyuan123 ,
Yes, I was able to get considerably good results, using the same training code.

hyyuan123 · 2023-04-26T06:43:36Z

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good

Hi @hyyuan123 , Yes, I was able to get considerably good results, using the same training code.

@kanthprashant
Thank you for your reply. I'll try again

1BTU · 2023-06-03T04:49:36Z

The training result is blank。

Hello,I had a similar problem with you,I used my equipment to train the weight given by the git_hub of author .However ,after many rounds of training,even though my learning rate had been modified very small,the predicted result was still gray and white .Later,I found that my code had two problems:
the first one was when the model was saved:

 torch.save(
            {
                "state_dict": sdafnet.state_dict(),
            },
            "savemodel.pt",
        )
    torch.save(sdafnet.state_dict(), "savemodel.pt")

In my code ,I save the model in dictionary form,but load the prediction directly ,that is , load it into the network by net.load_state_dict(), which leads to inaccurate prediction results. Here I save the weights by the following method:

torch.save(net.state_dict(),save_path)

The second reason is this:
I'm going to comment out the sdafnet = torch.nn.DataParallel(sdafnet, device_ids=range(torch.cuda.device_count())),save the model ,and it's going to to be good,for the following reasons:

nn.DataParallel is a module used for parallel computing on multiple GPUs. It can replicate a model to multiple GPUs and execute forward and backward propagation of input data in parallel. If you only have one GPU, it is not necessary to use nn.DataParallel.

If you use nn.DataParallel in your code to load a trained model weight and perform prediction on a single GPU, it may cause inaccurate, unstable, or even strange errors in the predicted results. This is because during the execution of nn.DataParallel, the model is replicated to multiple GPUs, and when predicting, only one GPU is used, which may result in different predicted results from the original model on a single GPU. Additionally, since nn.DataParallel divides the input data into multiple small batches for processing, this also affects the predicted results.

The solution is to directly load the weights on a single GPU when loading the model weights, rather than loading the weights on multiple GPUs. If you need to train using multiple GPUs, you can use nn.DataParallel in the training code, but remove it when performing validation or testing, and only use a single GPU for prediction.

xxxxl888 · 2023-06-05T14:26:33Z

你好，我想问问我的训练过程中保存的模型文件加载不出来，训练结果也是空的，loss值也没收敛，可能存在什么问题？

1BTU · 2023-06-18T12:48:51Z

你好，我想问问我的训练过程中保存的模型文件加载不出来，训练结果也是空的，loss值也没收敛，可能存在什么问题？
You can read my response above ,may be it can help u.

xxxxl888 · 2023-06-20T09:58:05Z

你好，我想问问我的训练过程中保存的模型文件加载不出来，训练结果也是空的，loss值也没收敛，可能存在什么问题？
You can read my response above ,may be it can help u.
你好，感谢你的提醒，我想再请教一个问题，我该如何计算FID和SSIM的分数？用哪两个数据集或者图片？

xxxxl888 · 2023-12-15T13:37:44Z

你好，我想问问我的训练过程中保存的模型文件加载不出来，训练结果也是空的，loss值也没收敛，可能存在什么问题？
You can read my response above ,may be it can help u.

Hello, would it be okay if I ask you a few questions? I'm wondering if it's normal for the first training result image I saved during the training process to be blank. Also, I noticed that the images you showed above had keypoint maps and clothing segmentation results, but my saved images don't seem to have those. Is there something I'm missing or doing wrong?And I have looked at your solution, but it still doesn't clearly solve my problem。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The training result is blank #11

The training result is blank #11

wang674 commented Nov 13, 2022

ShuaiBai623 commented Nov 14, 2022

wang674 commented Nov 14, 2022 via email

kanthprashant commented Apr 9, 2023

kanthprashant commented Apr 13, 2023

hyyuan123 commented Apr 25, 2023

kanthprashant commented Apr 25, 2023

hyyuan123 commented Apr 26, 2023

1BTU commented Jun 3, 2023

xxxxl888 commented Jun 5, 2023

1BTU commented Jun 18, 2023

xxxxl888 commented Jun 20, 2023

xxxxl888 commented Dec 15, 2023

The training result is blank #11

The training result is blank #11

Comments

wang674 commented Nov 13, 2022

ShuaiBai623 commented Nov 14, 2022

wang674 commented Nov 14, 2022 via email

kanthprashant commented Apr 9, 2023

kanthprashant commented Apr 13, 2023

hyyuan123 commented Apr 25, 2023

kanthprashant commented Apr 25, 2023

hyyuan123 commented Apr 26, 2023

1BTU commented Jun 3, 2023

xxxxl888 commented Jun 5, 2023

1BTU commented Jun 18, 2023

xxxxl888 commented Jun 20, 2023

xxxxl888 commented Dec 15, 2023