Error accessing Git repository #483

turbosonics · 2024-06-21T18:59:11Z

Describe the bug
I compiled MACE to local GPU server cluster using virtual environment with python39, cuda 11.8 & pytorch 2.3.0.

My input script looks like this:

mace_run_train \
    --name="test5" \
    --seed=123 \
    --device=cuda \
    --default_dtype="float32" \
    --error_table="PerAtomRMSEstressvirials" \
    --model="MACE" \
    --r_max=6.0 \
    --hidden_irreps='128x0e + 128x1o + 128x2e' \
    --num_channels=32 \
    --max_L=2 \
    --compute_stress=True \
    --compute_forces=True \
    --train_file="${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}" \
    --valid_fraction=0.2 \
    --test_file="${GEOMETRY_DIR}/${GEOMETRY_TEST_NAME}" \
    --E0s='{3:-0.1133, 8:-0.3142, 40:-1.6277, 57:-0.4746}' \
    --energy_key="energy" \
    --forces_key="forces" \
    --stress_key="stress" \
    --virials_key="virial" \
    --loss="weighted" \
    --forces_weight=1.0 \
    --energy_weight=1.0 \
    --stress_weight=1.0 \
    --virials_weight=1.0 \
    --config_type_weights='{"Default":1.0}' \
    --optimizer="adam" \
    --batch_size=5 \
    --valid_batch_size=5 \
    --lr=0.005 \
    --amsgrad \
    --ema \
    --ema_decay=0.99 \
    --max_num_epochs=2000 \
    --patience=50 \
    --keep_checkpoints \
    --restart_latest \
    --save_cpu \
    --clip_grad=100.0 \

Then, the training job fails almost immediately. Slurm system generates following error file:

Traceback (most recent call last):
  File "/home/venv_mace_gpu_cuda118_pytorch230/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 51, in main
    run(args)
  File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 168, in run
    assert args.train_file.endswith(".xyz"), "Must specify atomic_numbers when using .h5 train_file input"
AssertionError: Must specify atomic_numbers when using .h5 train_file input

Also, log file prints out as following:

2024-06-21 14:47:55.515 INFO: MACE version: 0.3.5
2024-06-21 14:47:55.516 INFO: Configuration: Namespace(config=None, name='test5', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float32', distributed=False, log_level='INFO', error_table='PerAtomRMSEstressvirials', model='MACE', r_max=6.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o + 128x2e', num_channels=32, max_L=2, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=True, compute_forces=True, train_file='testinput.extxyz', valid_file=None, valid_fraction=0.2, test_file='testinput.extxyz', test_dir=None, multi_processed_test=False, num_workers=0, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file=None, E0s='{3:-0.1133, 8:-0.3142, 40:-1.6277, 57:-0.4746}', keep_isolated_atoms=False, energy_key='energy', forces_key='forces', virials_key='virial', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='weighted', forces_weight=1.0, swa_forces_weight=100.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=1.0, swa_stress_weight=10.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', beta=0.9, batch_size=5, valid_batch_size=5, lr=0.005, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=False, start_swa=None, ema=True, ema_decay=0.99, max_num_epochs=2000, patience=50, foundation_model=None, foundation_model_readout=True, eval_interval=2, keep_checkpoints=True, save_all_checkpoints=False, restart_latest=True, save_cpu=True, clip_grad=100.0, wandb=False, wandb_dir=None, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight'])
2024-06-21 14:47:55.546 INFO: CUDA version: 11.8, CUDA device: 0
2024-06-21 14:47:55.897 INFO: Error accessing Git repository: /mnt/work/MACE_training/20240621_test/01a5_test5

Is this happening because the size of the training set is huge?

The text was updated successfully, but these errors were encountered:

bernstei · 2024-06-21T19:02:24Z

No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?

turbosonics · 2024-06-21T19:08:25Z

No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?

Ha, my habit to distinguish two xyz formats brought this error.

I changed the geometry file name to *.xyz and resubmitted. Now the training job started, the crash doesn't occur.

Thank you.

ilyes319 closed this as completed Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error accessing Git repository #483

Error accessing Git repository #483

turbosonics commented Jun 21, 2024 •

edited

Loading

bernstei commented Jun 21, 2024

turbosonics commented Jun 21, 2024

Error accessing Git repository #483

Error accessing Git repository #483

Comments

turbosonics commented Jun 21, 2024 • edited Loading

bernstei commented Jun 21, 2024

turbosonics commented Jun 21, 2024

turbosonics commented Jun 21, 2024 •

edited

Loading