-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with netDM1 training with multiple GPUs #25
Comments
BTW, it's better to change submodule |
Thanks! I updated the submodule
The depth values returned by I don't have a solution for the multi gpu problem yet, sry. |
Thanks for your reply, for the multi GPU problem you may give me a guess on what could be wrong for me to start with if you don't have time recently. Could it be the problem of tfutils? Do you think a end to end training is possible with multiGPU model parallelization? |
I think the problem could be in tfutils in the You are right, the l1 loss for the motion is not normalized with respect to the batch size; this is an oversight. Multi GPU training for all evolutions should be possible. |
MultiGPU training with
netFlow1
works fine to me, while for the next evolution training some tensors are shown 'None values'.I'm running on a machine installed with 4 GPUs, while only feed the program two of them by
CUDA_VISIBLE_DEVICES=2,3
. I noticed thatget_gpu_count()
did return 4 even two gpus are fed into. I tried set_num_gpus = 2
manually but this didn't solve the problem.BTW: Did you not divide loss by batch size for some reason on purpose? or this is just another normal way to deal with loss and batch size..
Thanks!
The text was updated successfully, but these errors were encountered: