You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Then, the training job fails almost immediately. Slurm system generates following error file:
Traceback (most recent call last):
File "/home/venv_mace_gpu_cuda118_pytorch230/bin/mace_run_train", line 8, in <module>
sys.exit(main())
File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 51, in main
run(args)
File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 168, in run
assert args.train_file.endswith(".xyz"), "Must specify atomic_numbers when using .h5 train_file input"
AssertionError: Must specify atomic_numbers when using .h5 train_file input
No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?
No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?
Ha, my habit to distinguish two xyz formats brought this error.
I changed the geometry file name to *.xyz and resubmitted. Now the training job started, the crash doesn't occur.
Describe the bug
I compiled MACE to local GPU server cluster using virtual environment with python39, cuda 11.8 & pytorch 2.3.0.
My input script looks like this:
Then, the training job fails almost immediately. Slurm system generates following error file:
Also, log file prints out as following:
Is this happening because the size of the training set is huge?
The text was updated successfully, but these errors were encountered: