Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid gradients on Cuda backend using f32 above certain layer size #2646

Open
ionspin opened this issue Dec 28, 2024 · 0 comments
Open

Invalid gradients on Cuda backend using f32 above certain layer size #2646

ionspin opened this issue Dec 28, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ionspin
Copy link

ionspin commented Dec 28, 2024

Describe the bug
When running a toy example with one layer, zero weights, zero tensor input and zero tensor targets, on cuda backend with f32 the gradients returned are invalid when a layer is larger than certain size (not exactly clear what that limit is).

To Reproduce

  1. Run the supplied test with NDArray backend
  2. Observer the gradients, they should all be zero
  3. Run the supplied test with Cuda backend using f32
  4. Observer the gradients, they are not zero
  5. Decrease the dimesions to 64*64 and output 16 and run with Cuda backend.
  6. Observe the gradient, they should be zero as expected.

Expected behavior
Gradients are always zero.

Additional context
On suggestion by @nathanielsimard I tried cuda backend with f16 and the gradients would be as expected, all zeroes!

Test:

use std::marker::PhantomData;
use burn::module::{AutodiffModule, ModuleVisitor, ParamId};
use burn::prelude::Tensor;
use burn::tensor::backend::AutodiffBackend;

pub struct TensorVisitor<'a, M: AutodiffModule<B>, B: AutodiffBackend> {
    grads: &'a mut B::Gradients,
    phatom: PhantomData<M>,
    filter: Option<Vec<ParamId>>,
}

impl <'a, M: AutodiffModule<B>, B: AutodiffBackend> TensorVisitor<'a, M, B> {
    pub fn new(
        grads: &'a mut B::Gradients,
    ) -> Self {
        Self {
            grads,
            phatom: PhantomData::default(),
            filter: None
        }
    }
}

impl<'a, B, M> ModuleVisitor<B> for TensorVisitor<'a, M, B>
where
    B: AutodiffBackend,
    M: AutodiffModule<B>,
{
    fn visit_float<const D: usize>(&mut self, id: ParamId, tensor: &Tensor<B, D>) {
        println!("Visiting {}, tensor {}", id, tensor);
        if let Some(filter) = self.filter.as_ref() {
            if !filter.contains(&id) {
                return;
            }
        }
        let Some(grad) = tensor.grad_remove(self.grads) else {
            return;
        };
        println!("Found grad {}", grad);
        println!("Grad sum {}", grad.clone().sum());

    }
}

#[cfg(test)]
#[cfg(target_arch = "x86_64")]
mod test {
    use burn::backend::cuda_jit::{Cuda, CudaDevice};
    use burn::backend::Wgpu;
    use burn::module::Module;
    use burn::nn::Initializer::Zeros;
    use burn::nn::{Linear, LinearConfig};
    use burn::nn::loss::{MseLoss, Reduction};
    use burn::optim::GradientsParams;
    use burn::prelude::Tensor;
    use crate::reproducer::TensorVisitor;

    #[test]
    fn loss3() {
        // pub type TestAutodiffBackend = burn_autodiff::Autodiff<NdArray<f32>>;
        pub type TestAutodiffBackend = burn_autodiff::Autodiff<Cuda<f32>>;

        // let device = <NdArray as Backend>::Device::default();
        let device = CudaDevice::new(0);

        let width = 128;
        let height = 128;
        let output_dim = 64;

        let mut layer_config = LinearConfig::new(width * height, output_dim).with_bias(false);
        layer_config.initializer = Zeros;
        let layer: Linear<TestAutodiffBackend> = layer_config.init(&device);

        let output = layer.forward(Tensor::<TestAutodiffBackend, 2>::zeros(
            [1, width * height],
            &device,
        ));


        let target = Tensor::<TestAutodiffBackend, 2>::zeros([1, output_dim], &device);

        let loss = MseLoss::new().forward(output, target, Reduction::Sum);

        println!("Loss {}", loss.clone().into_scalar());

        let mut grads = loss.backward();

        let mut grads_params = GradientsParams::new();
        let mut visitor =
            crate::utils::tensor_utils::TensorVisitor::<Linear<TestAutodiffBackend>, TestAutodiffBackend>::new(&mut grads);
        layer.visit(&mut visitor);
    }
}

Result when running the above test (Cuda, 128*128, 64):

Loss 0
Visiting un58j9qgmsqos, tensor Tensor {
  data:
[[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 ...
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0]],
  shape:  [16384, 64],
  device:  Cuda(0),
  backend:  "autodiff<fusion<jit<cuda>>>",
  kind:  "Float",
  dtype:  "f32",
}
Found grad Tensor {
  data:
[[-98.48454, -4.135821, 15.220333, ..., 12.137083, -13.435214, 67.603455],
 [-76.20355, -0.23300187, 10.8677225, ..., 10.043093, -8.470245, 51.768158],
 [48.57503, -61.13626, 11.851284, ..., -19.866186, -34.369072, -21.829111],
 ...
 [16.582195, -23.020058, 4.7044444, ..., -0.6991582, -26.60276, -41.37246],
 [13.174987, -18.290035, 3.7378035, ..., -10.988364, 0.31040558, 21.740341],
 [18.719637, -25.987339, 5.3108463, ..., -19.119335, 7.649512, 49.245087]],
  shape:  [16384, 64],
  device:  Cuda(0),
  backend:  "fusion<jit<cuda>>",
  kind:  "Float",
  dtype:  "f32",
}
Grad sum Tensor {
  data:
[5304.6475],
  shape:  [1],
  device:  Cuda(0),
  backend:  "fusion<jit<cuda>>",
  kind:  "Float",
  dtype:  "f32",
}

Result when running with NDArray (NDArray, 128*128, 64):

Loss 0
Visiting c5rr8jcslsgua, tensor Tensor {
  data:
[[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 ...
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0]],
  shape:  [16384, 64],
  device:  Cpu,
  backend:  "autodiff<ndarray>",
  kind:  "Float",
  dtype:  "f32",
}
Found grad Tensor {
  data:
[[0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 ...
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
 [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0]],
  shape:  [16384, 64],
  device:  Cpu,
  backend:  "ndarray",
  kind:  "Float",
  dtype:  "f32",
}
Grad sum Tensor {
  data:
[0.0],
  shape:  [1],
  device:  Cpu,
  backend:  "ndarray",
  kind:  "Float",
  dtype:  "f32",
}
@laggui laggui added the bug Something isn't working label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants