Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with netDM1 training with multiple GPUs #25

Open
JiamingSuen opened this issue Oct 11, 2017 · 4 comments
Open

Problem with netDM1 training with multiple GPUs #25

JiamingSuen opened this issue Oct 11, 2017 · 4 comments

Comments

@JiamingSuen
Copy link

MultiGPU training with netFlow1 works fine to me, while for the next evolution training some tensors are shown 'None values'.
I'm running on a machine installed with 4 GPUs, while only feed the program two of them by CUDA_VISIBLE_DEVICES=2,3. I noticed that get_gpu_count() did return 4 even two gpus are fed into. I tried set _num_gpus = 2 manually but this didn't solve the problem.

BTW: Did you not divide loss by batch size for some reason on purpose? or this is just another normal way to deal with loss and batch size..

Thanks!

➜  v2 git:(devel) ✗ CUDA_VISIBLE_DEVICES=2,3 ipython training-testDM-multiGPU.py
Using /home/jiamingsun/Repositories/demon/build/multivih5datareaderop/multivih5datareaderop.so
Using /home/jiamingsun/Repositories/demon/lmbspecialops/build/lib/lmbspecialops.so
2017-10-11 17:01:19.298226: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 17:01:19.298265: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 17:01:19.298273: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 17:01:19.298281: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 17:01:19.298288: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 17:01:19.875586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:83:00.0
Total memory: 11.90GiB
Free memory: 11.75GiB
2017-10-11 17:01:19.875754: W tensorflow/stream_executor/cuda/cuda_driver.cc:485] creating context when one is currently active; existing: 0x3e92220
2017-10-11 17:01:20.206920: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 1 with properties:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:84:00.0
Total memory: 11.90GiB
Free memory: 11.75GiB
2017-10-11 17:01:20.208079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 1
2017-10-11 17:01:20.208094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y Y
2017-10-11 17:01:20.208100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 1:   Y Y
2017-10-11 17:01:20.208119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:83:00.0)
2017-10-11 17:01:20.208128: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:1) -> (device: 1, name: TITAN X (Pascal), pci bus id: 0000:84:00.0)
Tensor("tower_0/netFlow1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:0)
Tensor("tower_0/netDM1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:0)
[<tf.Tensor 'tower_0/gradients/tower_0/netDM1/predict_depthnormal2/split_grad/concat:0' shape=(8, 4, 48, 64) dtype=float32>, None, None]
[<tf.Tensor 'tower_0/gradients/tower_0/netDM1/split_grad/concat:0' shape=(8, 7) dtype=float32>, None, None]
Tensor("tower_1/netFlow1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:1)
Tensor("tower_1/netDM1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:1)
[<tf.Tensor 'tower_1/gradients/tower_1/netDM1/predict_depthnormal2/split_grad/concat:0' shape=(8, 4, 48, 64) dtype=float32>, None, None]
[<tf.Tensor 'tower_1/gradients/tower_1/netDM1/split_grad/concat:0' shape=(8, 7) dtype=float32>, None, None]
Tensor("tower_2/netFlow1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:2)
Tensor("tower_2/netDM1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:2)
[<tf.Tensor 'tower_2/gradients/tower_2/netDM1/predict_depthnormal2/split_grad/concat:0' shape=(8, 4, 48, 64) dtype=float32>, None, None]
[<tf.Tensor 'tower_2/gradients/tower_2/netDM1/split_grad/concat:0' shape=(8, 7) dtype=float32>, None, None]
Tensor("tower_3/netFlow1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:3)
Tensor("tower_3/netDM1/dense5/LeakyRelu:0", shape=(8, 4608), dtype=float32, device=/device:GPU:3)
[<tf.Tensor 'tower_3/gradients/tower_3/netDM1/predict_depthnormal2/split_grad/concat:0' shape=(8, 4, 48, 64) dtype=float32>, None, None]
[<tf.Tensor 'tower_3/gradients/tower_3/netDM1/split_grad/concat:0' shape=(8, 7) dtype=float32>, None, None]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py in apply_op(self, op_type_name, name, **keywords)
    490                 as_ref=input_arg.is_ref,
--> 491                 preferred_dtype=default_dtype)
    492           except TypeError as err:

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
    703         if ret is None:
--> 704           ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    705

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    112   _ = as_ref
--> 113   return constant(v, dtype=dtype, name=name)
    114

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name, verify_shape)
    101   tensor_value.tensor.CopyFrom(
--> 102       tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
    103   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)
    359     if values is None:
--> 360       raise ValueError("None values not supported.")
    361     # if dtype is provided, forces numpy array to be the type

ValueError: None values not supported.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py in apply_op(self, op_type_name, name, **keywords)
    504               observed = ops.internal_convert_to_tensor(
--> 505                   values, as_ref=input_arg.is_ref).dtype.name
    506             except ValueError as err:

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
    703         if ret is None:
--> 704           ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    705

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    112   _ = as_ref
--> 113   return constant(v, dtype=dtype, name=name)
    114

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name, verify_shape)
    101   tensor_value.tensor.CopyFrom(
--> 102       tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
    103   dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)
    359     if values is None:
--> 360       raise ValueError("None values not supported.")
    361     # if dtype is provided, forces numpy array to be the type

ValueError: None values not supported.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/home/jiamingsun/Repositories/demon/training/v2/training-testDM-multiGPU.py in <module>()
    619
    620 if __name__ == '__main__':
--> 621     tf.app.run()
    622
    623

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py in run(main, argv)
     46   # Call the main function, passing through any arguments
     47   # to the final program.
---> 48   _sys.exit(main(_sys.argv[:1] + flags_passthrough))
     49
     50

/home/jiamingsun/Repositories/demon/training/v2/training-testDM-multiGPU.py in main(argv)
    558
    559     # combine gradients from all towers
--> 560     avg_grads = average_gradients(tower_grads)
    561
    562     optimize_op = optimizer.apply_gradients(grads_and_vars=avg_grads, global_step=trainer.global_step())

/home/jiamingsun/Repositories/demon/python/tfutils/helpers.py in average_gradients(tower_grads)
    276     for g, _ in grad_and_vars:
    277       # Add 0 dimension to the gradients to represent the tower.
--> 278       expanded_g = tf.expand_dims(g, 0)
    279
    280       # Append on a 'tower' dimension which we will average over below.

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py in expand_dims(input, axis, name, dim)
    168       raise ValueError("can't specify both 'dim' and 'axis'")
    169     axis = dim
--> 170   return gen_array_ops._expand_dims(input, axis, name)
    171 # pylint: enable=redefined-builtin,protected-access
    172

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py in _expand_dims(input, dim, name)
    898     dimension of size 1 added.
    899   """
--> 900   result = _op_def_lib.apply_op("ExpandDims", input=input, dim=dim, name=name)
    901   return result
    902

/home/jiamingsun/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py in apply_op(self, op_type_name, name, **keywords)
    507               raise ValueError(
    508                   "Tried to convert '%s' to a tensor and failed. Error: %s" %
--> 509                   (input_name, err))
    510             prefix = ("Input '%s' of '%s' Op has type %s that does not match" %
    511                       (input_name, op_type_name, observed))

ValueError: Tried to convert 'input' to a tensor and failed. Error: None values not supported.
@JiamingSuen JiamingSuen changed the title Problem with netDM1 training in multiple GPUs Problem with netDM1 training with multiple GPUs Oct 11, 2017
@JiamingSuen
Copy link
Author

BTW, it's better to change submodule lmbspecialops @ 06fd1a4 to the latest commit to catch up with the changes in flow_to_depth2.
What kind of bug have been fixed exactly in flow_to_depth2? Does it mean networks trained with the deprecated version flow_to_depth is no longer valid at all?

@benjaminum
Copy link
Collaborator

BTW, it's better to change submodule lmbspecialops @ 06fd1a4 to the latest commit to catch up with the changes in flow_to_depth2.

Thanks! I updated the submodule

What kind of bug have been fixed exactly in flow_to_depth2? Does it mean networks trained with the deprecated version flow_to_depth is no longer valid at all?

The depth values returned by flow_to_depth were wrong (the normalization of some parameters went wrong), but the networks are able to adapt to these wrong values.
We will keep flow_to_depth for the networks that have been trained with it.
For new networks trained from scratch I recommend using the fixed version.

I don't have a solution for the multi gpu problem yet, sry.

@JiamingSuen
Copy link
Author

Thanks for your reply, for the multi GPU problem you may give me a guess on what could be wrong for me to start with if you don't have time recently. Could it be the problem of tfutils? Do you think a end to end training is possible with multiGPU model parallelization?
And what's your comment for the loss calculation, did you not divide loss by batch size for some reason on purpose? or this is just another normal way to deal with loss and batch size.

@benjaminum
Copy link
Collaborator

I think the problem could be in tfutils in the average_gradients or combine_loss_dicts function.
Unfortunately, I don't have time right now to investigate.

You are right, the l1 loss for the motion is not normalized with respect to the batch size; this is an oversight.
For the different towers we average the losses in the combine_loss_dicts function.

Multi GPU training for all evolutions should be possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants