Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError: No module named 'flash_attn.models.falcon' #22

Open
Sniper970119 opened this issue Dec 26, 2023 · 11 comments
Open

Comments

@Sniper970119
Copy link

Sniper970119 commented Dec 26, 2023

I ran the bash scripts/setup_flash.sh without error (but it cost just a few minute)

image

But I got a wrong message when I run the bash scripts/run_pile.sh

Traceback (most recent call last):
  File "doremi/train.py", line 56, in <module>
    import doremi.models as doremi_models
  File "/storage/home/lanzhenzhongLab/zhaoyu/doremi/doremi/models.py", line 8, in <module>
    from flash_attn.models.gpt import GPTLMHeadModel as GPTLMHeadModelFlash
  File "/storage/home/lanzhenzhongLab/zhaoyu/.conda/envs/zy_doremi/lib/python3.8/site-packages/flash_attn-2.0.4-py3.8-linux-x86_64.egg/flash_attn/models/gpt.py", line 31, in <module>
    from flash_attn.models.falcon import remap_state_dict_hf_falcon
ModuleNotFoundError: No module named 'flash_attn.models.falcon'

what wrong with this?

also,I found something wrong with bash scripts/run_preprocess_pile.sh until I update the packages datasets to 2.15.0 , but version in setup.py is 2.10.1.
Is something wrong in my operate?

@sangmichaelxie
Copy link
Owner

Just pushed an update that should fix the import issue. Could you add more details on the preprocess issue?

@Sniper970119
Copy link
Author

Sniper970119 commented Dec 26, 2023

tks for ur help! I will try it later.

The preprocessing issue is as follows: it can be successfully run by updating datasets to 2.15.0

Traceback (most recent call last):
  File "scripts/preprocess_pile.py", line 135, in <module>
    main()
  File "scripts/preprocess_pile.py", line 87, in main
    ds = load_dataset('json',
  File "/xxxx/zy_doremi/lib/python3.8/site-packages/datasets/load.py", line 1775, in load_dataset
    return builder_instance.as_streaming_dataset(split=split)
  File "/sxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/builder.py", line 1234, in as_streaming_dataset
    raise NotImplementedError(
NotImplementedError: Loading a streaming dataset cached in a LocalFileSystem is not supported yet.

@Sniper970119
Copy link
Author

Sniper970119 commented Dec 26, 2023

Hello, another error when I run the lastest code.

File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/sxxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/xxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
    return self._process_data(data)
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
    data.reraise()
  File "/sxxxxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/xxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 807, in __iter__
    for element in self.dataset:
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1358, in __iter__
    yield from self._iter_pytorch()
  File "/sxxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1293, in _iter_pytorch
    for key, example in ex_iterable:
  File "/xxxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
  File "/xxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/packaged_modules/generator/generator.py", line 30, in _generate_examples
    for idx, ex in enumerate(self.config.generator(**gen_kwargs)):
  File "/sxxxx/doremi/doremi/dataloader.py", line 339, in take_data_generator
    for ex in ds:
  File "/xxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1358, in __iter__
    yield from self._iter_pytorch()
  File "/sxxxxxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1274, in _iter_pytorch
    ex_iterable = ex_iterable.shard_data_sources(worker_id=worker_info.id, num_workers=worker_info.num_workers)
TypeError: shard_data_sources() got an unexpected keyword argument 'worker_id'

Wandb has been init ,and have this error. I run the code with a 8*A100 GPUs machine. It something wrong with my config? or the wrong version I used.(maybe version with datasets?)

also, the training args have a sooooooo big num epochs, but the num epoch in training_args is still 3.

[ERROR|tokenization_utils_base.py:1042] 2023-12-26 19:04:00,614 >> Using pad_token, but it is not set yet.
[INFO|trainer.py:543] 2023-12-26 19:04:07,514 >> max_steps is given, it will override any value given in num_train_epochs
[INFO|trainer.py:597] 2023-12-26 19:04:07,515 >> Using cuda_amp half precision backend
[INFO|trainer.py:1740] 2023-12-26 19:04:12,755 >> ***** Running training *****
[INFO|trainer.py:1741] 2023-12-26 19:04:12,755 >>   Num examples = 102400000
[INFO|trainer.py:1742] 2023-12-26 19:04:12,755 >>   Num Epochs = 9223372036854775807
[INFO|trainer.py:1743] 2023-12-26 19:04:12,755 >>   Instantaneous batch size per device = 64
[INFO|trainer.py:1744] 2023-12-26 19:04:12,755 >>   Total train batch size (w. parallel, distributed & accumulation) = 512
[INFO|trainer.py:1745] 2023-12-26 19:04:12,755 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1746] 2023-12-26 19:04:12,755 >>   Total optimization steps = 200000
[INFO|trainer.py:1747] 2023-12-26 19:04:12,755 >>   Number of trainable parameters = 123671040

@sangmichaelxie
Copy link
Owner

Yes it's because of the different version of datasets. You could try doing pip install fsspec==2023.9.2 with the original datasets version.

The num_epochs should be ignored - the training terminates on steps.

@Sniper970119
Copy link
Author

tksss for ur help!! I can run the run_pile_baseline120M.sh now.
But I found that something wrong when I run the run_pile_doremi120M.sh seem like something wrong in backward.

File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "xxx/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/doremi/doremi/trainer.py", line 386, in training_step
    loss.backward()
  File "/xxxu/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/xxxx.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Sorry for asking so many run-related questions. I really can't resolve these issues.😭

@Sniper970119
Copy link
Author

Sniper970119 commented Dec 28, 2023

All package version are same to the setup.py

image

@sangmichaelxie
Copy link
Owner

I'm not totally sure, but sometimes these errors can be mitigated by uninstalling and doing a fresh install of torch / flash-attn. You could also try running on CPU or with CUDA_LAUNCH_BLOCKING=1 to see if it will give a better trace. It could also be worth checking your CUDA version vs. which CUDA version your pytorch is compiled with.

@Sniper970119
Copy link
Author

Sniper970119 commented Jan 10, 2024

The code can't run with CPU mode. And I have rebuilt the conda environment for several times.
It have a lot of error log with

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [7,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [32,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Is this right?
My cuda version is 11.7 and torch.cuda.is_available is True

I can run baseline successful, but wrong in doremi120M

@sangmichaelxie
Copy link
Owner

Do you have a more detailed trace? What part of the code raises this error?

@Sniper970119
Copy link
Author

Sniper970119 commented Jan 17, 2024

sure,there are the detail error message. run on a *8A100 server

  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [6,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [8,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [15,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [19,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [39,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [47,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [1,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [34,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [43,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [60,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [63,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb982d8c4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb982d5636b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb982e30fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fb982e017bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fb982e10d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fb9c22c2b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fb982d71e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fb982d6a69e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb982d6a7b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fb9c2548188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb9c2548535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fba0928a6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc421b4e4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc421b1836b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc421bf2fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fc421bc37bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fc421bd2d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fc461084b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fc421b33e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fc421b2c69e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fc421b2c7b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fc46130a188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fc46130a535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fc4a804f6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [35,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [44,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [12,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [16,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [23,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [27,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [30,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [14,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [25,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [36,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [37,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [48,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f19ad7564d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f19ad72036b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f19ad7fafa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7f19ad7cb7bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f19ad7dad80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7f19ecc8cb86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f19ad73be77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f19ad73469e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f19ad7347b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7f19ecf12188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f19ecf12535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f1a33c516a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb1eefcb4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fb1eef9536b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fb1ef06ffa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7fb1ef0407bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7fb1ef04fd80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7fb22e501b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7fb1eefb0e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7fb1eefa969e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fb1eefa97b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7fb22e787188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7fb22e787535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fb2754c66a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f545cb834d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f545cb4d36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f545cc27fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f545db0f410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f545db129e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f545db13f37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f54db71bb73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f54e3bb32de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f54e315ae83 in /lib64/libc.so.6)

Traceback (most recent call last):
  File "doremi/train.py", line 405, in <module>
    main()
  File "doremi/train.py", line 342, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/code/opensource/doremi/doremi/trainer.py", line 351, in training_step
    curr_domain_weights = train_domain_weights[inputs['domain_ids']].unsqueeze(-1).expand_as(pertoken_loss).detach()
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f44b65cf4d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f44b659936b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f44b6673fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f44b755b410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f44b755e9e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f44b755ff37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f4535197b73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f453d62f2de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f453cbd6e83 in /lib64/libc.so.6)

wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6106bf04d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f6106bba36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f6106c94fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f6107b7c410 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f6107b7f9e8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x227 (0x7f6107b80f37 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xc2b73 (0x7f618578ab73 in /lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x82de (0x7f618dc222de in /lib64/libpthread.so.0)
frame #8: clone + 0x43 (0x7f618d1c9e83 in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2cdf7224d7 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2cdf6ec36b in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f2cdf7c6fa8 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x137bb (0x7f2cdf7977bb in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x22d80 (0x7f2cdf7a6d80 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4ccb86 (0x7f2d1ec58b86 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3ee77 (0x7f2cdf707e77 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f2cdf70069e in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2cdf7007b9 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x752188 (0x7f2d1eede188 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x325 (0x7f2d1eede535 in /xxx/.conda/envs/zy_doremi/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /xxx/.conda/envs/zy_doremi/bin/python() [0x50243d]
frame #12: /xxx/.conda/envs/zy_doremi/bin/python() [0x4e0970]
frame #13: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1828]
frame #14: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #15: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #16: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #17: /xxx/.conda/envs/zy_doremi/bin/python() [0x4f1811]
frame #18: /xxx/.conda/envs/zy_doremi/bin/python() [0x4c9310]
frame #19: PyDict_SetItemString + 0x52 (0x581a82 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #20: PyImport_Cleanup + 0x93 (0x5a6cb3 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #21: Py_FinalizeEx + 0x71 (0x5a5de1 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #22: Py_RunMain + 0x112 (0x5a1ab2 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #23: Py_BytesMain + 0x39 (0x579e89 in /xxx/.conda/envs/zy_doremi/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7f2d65c1e6a3 in /lib64/libc.so.6)
frame #25: /xxx/.conda/envs/zy_doremi/bin/python() [0x579d3d]

wandb: - 0.017 MB of 0.017 MB uploaded (0.000 MB deduped)

@Richard-Wth
Copy link

I face the same issue, but it worked when I lowered batch_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants