Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error accessing Git repository #483

Closed
turbosonics opened this issue Jun 21, 2024 · 2 comments
Closed

Error accessing Git repository #483

turbosonics opened this issue Jun 21, 2024 · 2 comments

Comments

@turbosonics
Copy link

turbosonics commented Jun 21, 2024

Describe the bug
I compiled MACE to local GPU server cluster using virtual environment with python39, cuda 11.8 & pytorch 2.3.0.

My input script looks like this:

mace_run_train \
    --name="test5" \
    --seed=123 \
    --device=cuda \
    --default_dtype="float32" \
    --error_table="PerAtomRMSEstressvirials" \
    --model="MACE" \
    --r_max=6.0 \
    --hidden_irreps='128x0e + 128x1o + 128x2e' \
    --num_channels=32 \
    --max_L=2 \
    --compute_stress=True \
    --compute_forces=True \
    --train_file="${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}" \
    --valid_fraction=0.2 \
    --test_file="${GEOMETRY_DIR}/${GEOMETRY_TEST_NAME}" \
    --E0s='{3:-0.1133, 8:-0.3142, 40:-1.6277, 57:-0.4746}' \
    --energy_key="energy" \
    --forces_key="forces" \
    --stress_key="stress" \
    --virials_key="virial" \
    --loss="weighted" \
    --forces_weight=1.0 \
    --energy_weight=1.0 \
    --stress_weight=1.0 \
    --virials_weight=1.0 \
    --config_type_weights='{"Default":1.0}' \
    --optimizer="adam" \
    --batch_size=5 \
    --valid_batch_size=5 \
    --lr=0.005 \
    --amsgrad \
    --ema \
    --ema_decay=0.99 \
    --max_num_epochs=2000 \
    --patience=50 \
    --keep_checkpoints \
    --restart_latest \
    --save_cpu \
    --clip_grad=100.0 \

Then, the training job fails almost immediately. Slurm system generates following error file:

Traceback (most recent call last):
  File "/home/venv_mace_gpu_cuda118_pytorch230/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 51, in main
    run(args)
  File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 168, in run
    assert args.train_file.endswith(".xyz"), "Must specify atomic_numbers when using .h5 train_file input"
AssertionError: Must specify atomic_numbers when using .h5 train_file input

Also, log file prints out as following:

2024-06-21 14:47:55.515 INFO: MACE version: 0.3.5
2024-06-21 14:47:55.516 INFO: Configuration: Namespace(config=None, name='test5', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float32', distributed=False, log_level='INFO', error_table='PerAtomRMSEstressvirials', model='MACE', r_max=6.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o + 128x2e', num_channels=32, max_L=2, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=True, compute_forces=True, train_file='testinput.extxyz', valid_file=None, valid_fraction=0.2, test_file='testinput.extxyz', test_dir=None, multi_processed_test=False, num_workers=0, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file=None, E0s='{3:-0.1133, 8:-0.3142, 40:-1.6277, 57:-0.4746}', keep_isolated_atoms=False, energy_key='energy', forces_key='forces', virials_key='virial', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='weighted', forces_weight=1.0, swa_forces_weight=100.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=1.0, swa_stress_weight=10.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', beta=0.9, batch_size=5, valid_batch_size=5, lr=0.005, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=False, start_swa=None, ema=True, ema_decay=0.99, max_num_epochs=2000, patience=50, foundation_model=None, foundation_model_readout=True, eval_interval=2, keep_checkpoints=True, save_all_checkpoints=False, restart_latest=True, save_cpu=True, clip_grad=100.0, wandb=False, wandb_dir=None, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight'])
2024-06-21 14:47:55.546 INFO: CUDA version: 11.8, CUDA device: 0
2024-06-21 14:47:55.897 INFO: Error accessing Git repository: /mnt/work/MACE_training/20240621_test/01a5_test5

Is this happening because the size of the training set is huge?

@bernstei
Copy link
Collaborator

No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?

@turbosonics
Copy link
Author

No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?

Ha, my habit to distinguish two xyz formats brought this error.

I changed the geometry file name to *.xyz and resubmitted. Now the training job started, the crash doesn't occur.

Thank you.

@ilyes319 ilyes319 closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants