The COCONUT microbial and plants natural products MAP4 TMAP colored by plant (in green), fungal (in blue) or bacterial (in orange) origin.
After cloning the repo:
- cat data/COCONUT_DB.sdf.gz.part-* | gunzip -d -c > data/COCONUT_DB.sdf
- cat data/MAP4-SVM-coconut.all.pkl.gz-part-* | gunzip -d -c > data/MAP4-SVM-coconut.all.pkl
The February 2021 version of the COCONUT was downloaded. the 60,171 COCONUT entries with a publication source and annotated as fungal, bacterial, or plant NPs were extracted. Number of carbons, oxygen, and nitrogens, total number of atoms, number of bonds were extracted from the DB. MW, fraction of sp3 C, hydrogen bond donor (HBD) and acceptor (HBA) count, calculated logP with the Crippen method (AlogP), and topological polar surface area (TPSA) were calculated using RDKit. To identify glycosylated and/or peptidic structures Daylight SMARTS language was used. Molecules that violated more than one Lipinski rule were labeled as non-Lipinski. The MAP4 fingerprint was calculated in 1024 dimensions.
The coconut SUBSET entries were assigned to training or test set with a 50% random split. The SVM was trained using the MAP4 fingerprints of the training set, and it utilized a custom kernel.
Please note that when using MAP4 for machine learning a custom kernel (or a custom loss function) is needed because the similarity/dissimilarity between two MinHashed fingerprints cannot be assessed with "standard" Jaccard, Manhattan, or Cosine functions. In fact, due to MinHashing, the order of the features matters, and the distance cannot be calculated "feature-wise". There is a well-written blog post that explains it.
- The custom kernel implemented for the SVM models calculates the similarity matrix between two lists of MinHashed fingerprints; where the similarity of fingerprint a and fingerprint b is calculated (1) counting of elements with the same value and the same index across a and b, and (2) dividing the obtained value by the number of elements of fingerprint a.
The class weights were inversely proportional to the class frequency, and the hyperparameter C was optimized using 5-fold cross-validation. During the hyperparameter optimization, 20% of the training set was left out as a validation set, and the balanced accuracy of the validation set was maximized. The hyperparameter C was optimized among the values 0.1,1, 10, 100, and 1000, resulting in C = 1.
The classifier was implemented using scikit-learn with the “one versus rest” strategy.
After the evaluation process, a second version of the MAP4 SVM classifier was trained using both training and test to learn from all curated 60 thousand datapoints. This version of the MAP4 SVM classifier can be use here.
Using the indices generated by the MinHashing procedure of the MAP4 calculation, an LSH forest was generated and used to layout the TMAP. The resulting TMAP can be found here.
The MAP4, ECFP4, and the RDKit AP fingerprints and a set of 11 properties (MW, fraction of sp3 C, HBD and HBA count, AlogP, number of carbons, oxygen, and nitrogens, total number of atoms, number of bonds, and TPSA) were used to train four different SVM classifiers in a 5-fold cross valiadation. For all classifiers the class weights were inversely proportional to the class frequency, and the hyperparameters were optimized using 10% of the available data to maximaze the balance accuracy (Table 4). For the properties SVM, the 11 values were scaled to zero mean and unit variance. All classifiers were implemented using scikit-learn with the “one versus rest” strategy.
conda env create -f environment.yml
conda activate aipep