First please make and go to the data
folder:
mkdir -p data
cd data
And then do the following for dataset and evaluation preparation.
- For small molecule editing dataset, please check
small_molecule_editing.txt
. Credit to MoleculeSTM paper. - For the retrieval database, please use the ZINC250K dataset from here.
- Both the editing and retrieval dataset can be found in this repo.
- We provide most of the pretrained datasets in
peptide
. You only need to download theData_S3.csv
from this link. - If you want to do the data preprocessing yourself, please refer to the following:
cd peptide
python preprocess_step_1_data_extraction.py
python preprocess_step_2_single_prop.py
python preprocess_step_3_multi_prop.py
- Download dataset from this google drive.
- Unzip to
protein
folder. - This includes both the editing and retrieval dataset.
- For evaluation, please download
pytorch_model_ss3.bin
from this link. Credit to ProteinDT.
.
├── peptide
│ ├── class1_pseudosequences.csv
│ ├── Data_S3.csv
│ ├── models_class1_presentation
│ │ ├── 10755300.stderr
│ │ .
│ │ .
│ │ .
│ │ └── train_data.csv.bz2
│ ├── peptide_editing.json
│ ├── peptide_editing.json
│ ├── peptide_editing_threshold.json
│ ├── preprocess_step_1_data_extraction.py
│ ├── preprocess_step_2_single_prop.py
│ ├── preprocess_step_3_multi_prop.py
│ └── selected_alleles.txt
├── protein
│ ├── downstream_datasets
│ │ └── secondary_structure
│ │ ├── secondary_structure_casp12.lmdb
│ │ │ ├── data.mdb
│ │ │ └── lock.mdb
│ │ ├── secondary_structure_cb513.lmdb
│ │ │ ├── data.mdb
│ │ │ └── lock.mdb
│ │ ├── secondary_structure_train.lmdb
│ │ │ ├── data.mdb
│ │ │ └── lock.mdb
│ │ ├── secondary_structure_ts115.lmdb
│ │ │ ├── data.mdb
│ │ │ └── lock.mdb
│ │ └── secondary_structure_valid.lmdb
│ │ ├── data.mdb
│ │ └── lock.mdb
│ ├── pytorch_model_ss3.bin
│ └── pytorch_model_ss8.bin
├── README.md
└── small_molecule
├── 250k_rndm_zinc_drugs_clean_3.csv
└── small_molecule_editing.txt