Invalid gradients on Cuda backend using f32 above certain layer size #2646

ionspin · 2024-12-28T15:30:24Z

Describe the bug
When running a toy example with one layer, zero weights, zero tensor input and zero tensor targets, on cuda backend with f32 the gradients returned are invalid when a layer is larger than certain size (not exactly clear what that limit is).

To Reproduce

Run the supplied test with NDArray backend
Observer the gradients, they should all be zero
Run the supplied test with Cuda backend using f32
Observer the gradients, they are not zero
Decrease the dimesions to 64*64 and output 16 and run with Cuda backend.
Observe the gradient, they should be zero as expected.

Expected behavior
Gradients are always zero.

Additional context
On suggestion by @nathanielsimard I tried cuda backend with f16 and the gradients would be as expected, all zeroes!

Test:

use std::marker::PhantomData;
use burn::module::{AutodiffModule, ModuleVisitor, ParamId};
use burn::prelude::Tensor;
use burn::tensor::backend::AutodiffBackend;

pub struct TensorVisitor<'a, M: AutodiffModule<B>, B: AutodiffBackend> {
    grads: &'a mut B::Gradients,
    phatom: PhantomData<M>,
    filter: Option<Vec<ParamId>>,
}

impl <'a, M: AutodiffModule<B>, B: AutodiffBackend> TensorVisitor<'a, M, B> {
    pub fn new(
        grads: &'a mut B::Gradients,
    ) -> Self {
        Self {
            grads,
            phatom: PhantomData::default(),
            filter: None
        }
    }
}

impl<'a, B, M> ModuleVisitor<B> for TensorVisitor<'a, M, B>
where
    B: AutodiffBackend,
    M: AutodiffModule<B>,
{
    fn visit_float<const D: usize>(&mut self, id: ParamId, tensor: &Tensor<B, D>) {
        println!("Visiting {}, tensor {}", id, tensor);
        if let Some(filter) = self.filter.as_ref() {
            if !filter.contains(&id) {
                return;
            }
        }
        let Some(grad) = tensor.grad_remove(self.grads) else {
            return;
        };
        println!("Found grad {}", grad);
        println!("Grad sum {}", grad.clone().sum());

    }
}

#[cfg(test)]
#[cfg(target_arch = "x86_64")]
mod test {
    use burn::backend::cuda_jit::{Cuda, CudaDevice};
    use burn::backend::Wgpu;
    use burn::module::Module;
    use burn::nn::Initializer::Zeros;
    use burn::nn::{Linear, LinearConfig};
    use burn::nn::loss::{MseLoss, Reduction};
    use burn::optim::GradientsParams;
    use burn::prelude::Tensor;
    use crate::reproducer::TensorVisitor;

    #[test]
    fn loss3() {
        // pub type TestAutodiffBackend = burn_autodiff::Autodiff<NdArray<f32>>;
        pub type TestAutodiffBackend = burn_autodiff::Autodiff<Cuda<f32>>;

        // let device = <NdArray as Backend>::Device::default();
        let device = CudaDevice::new(0);

        let width = 128;
        let height = 128;
        let output_dim = 64;

        let mut layer_config = LinearConfig::new(width * height, output_dim).with_bias(false);
        layer_config.initializer = Zeros;
        let layer: Linear<TestAutodiffBackend> = layer_config.init(&device);

        let output = layer.forward(Tensor::<TestAutodiffBackend, 2>::zeros(
            [1, width * height],
            &device,
        ));


        let target = Tensor::<TestAutodiffBackend, 2>::zeros([1, output_dim], &device);

        let loss = MseLoss::new().forward(output, target, Reduction::Sum);

        println!("Loss {}", loss.clone().into_scalar());

        let mut grads = loss.backward();

        let mut grads_params = GradientsParams::new();
        let mut visitor =
            crate::utils::tensor_utils::TensorVisitor::<Linear<TestAutodiffBackend>, TestAutodiffBackend>::new(&mut grads);
        layer.visit(&mut visitor);
    }
}

Result when running the above test (Cuda, 128*128, 64):

Loss 0
Visiting un58j9qgmsqos, tensor Tensor {
  data:
[[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 ...
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0]],
  shape:  [16384, 64],
  device:  Cuda(0),
  backend:  "autodiff<fusion<jit<cuda>>>",
  kind:  "Float",
  dtype:  "f32",
}
Found grad Tensor {
  data:
[[-98.48454, -4.135821, 15.220333, ..., 12.137083, -13.435214, 67.603455],
 [-76.20355, -0.23300187, 10.8677225, ..., 10.043093, -8.470245, 51.768158],
 [48.57503, -61.13626, 11.851284, ..., -19.866186, -34.369072, -21.829111],
 ...
 [16.582195, -23.020058, 4.7044444, ..., -0.6991582, -26.60276, -41.37246],
 [13.174987, -18.290035, 3.7378035, ..., -10.988364, 0.31040558, 21.740341],
 [18.719637, -25.987339, 5.3108463, ..., -19.119335, 7.649512, 49.245087]],
  shape:  [16384, 64],
  device:  Cuda(0),
  backend:  "fusion<jit<cuda>>",
  kind:  "Float",
  dtype:  "f32",
}
Grad sum Tensor {
  data:
[5304.6475],
  shape:  [1],
  device:  Cuda(0),
  backend:  "fusion<jit<cuda>>",
  kind:  "Float",
  dtype:  "f32",
}

Result when running with NDArray (NDArray, 128*128, 64):

Loss 0
Visiting c5rr8jcslsgua, tensor Tensor {
  data:
[[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 ...
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0]],
  shape:  [16384, 64],
  device:  Cpu,
  backend:  "autodiff<ndarray>",
  kind:  "Float",
  dtype:  "f32",
}
Found grad Tensor {
  data:
[[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 ...
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0]],
  shape:  [16384, 64],
  device:  Cpu,
  backend:  "ndarray",
  kind:  "Float",
  dtype:  "f32",
}
Grad sum Tensor {
  data:
[0.0],
  shape:  [1],
  device:  Cpu,
  backend:  "ndarray",
  kind:  "Float",
  dtype:  "f32",
}

The text was updated successfully, but these errors were encountered:

laggui added the bug Something isn't working label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid gradients on Cuda backend using f32 above certain layer size #2646

Invalid gradients on Cuda backend using f32 above certain layer size #2646

ionspin commented Dec 28, 2024

Invalid gradients on Cuda backend using f32 above certain layer size #2646

Invalid gradients on Cuda backend using f32 above certain layer size #2646

Comments

ionspin commented Dec 28, 2024