The coconut database was downloaded, and its 400,837 entries were filtered down to the 67,730 structures having taxonomical annotation and a DOI annotation not shorter than 10 characters.
MW, Fsp3, HBD, and HBA count, and the LogP calculated following the Crippen6 method (AlogP) were calculated using RDKit.
Molecules breaking more than one Lipinski rule were labeled as non-Lipinski.
The presence/absence of a peptide or a glycoside moiety was evaluated using Daylight SMILES arbitrary target specification (SMARTS) language and RDKit.
The glycoside substructure was defined as a cyclic N- or O-acetal substructure. The plant, fungal, bacterial, animal, marine, or human origin of the natural products was extrapolated from the COCONUT taxonomy annotation.
If belonging to more than one class, the entry was with the priority: human > animal > bacteria > fungi > plant > marine.
The processes resulted in 33,821 plants NPs, 15,693 fungal NPs, 10,819 bacterial NPs, 1,779 human NPs, 1,219 animal NPs, 1,035 marine NPs, and 3,364 NPs missing a super classification labelled as “other”.
Within this analysis, lipidated natural products were selected following four criteria:
- the presence of a terminal eight carbons long aliphatic chain
- the presence of at least one non-carbohydrate ring,
- the presence of a non-lipidic and non-carbohydrate core of at least 200 Da
- and the absence of a sterol substructure.
The presence of a terminal eight carbons long aliphatic chain was determined with RDKit using the Daylight SMARTS language.
To assess if terminal, the lipidic chain substructure was identified through SMARTS and removed, and the length of the SMILES of the remaining fragments was calculated.
When only one of the remaining fragments had a SMILES length of more than ten characters, the chain was considered terminal.
Non-carbohydrates rings were counted using RDKit and the “sugar free SMILES” annotated in COCONUT.
To assess the MW of the non-lipidic and non-carbohydrate core, the lipid chain substructure was identified through SMARTS and removed from the COCONUT “sugar free” SMILES, and the MW of the remaining largest fragment was calculated.
The absence of sterol substructure was evaluated using Daylight SMARTS language with RDKit.
The selection led to 1,390 lipidated natural products, which were further filtered down manually to 1,308 structures.
The 1,308 structures were characterized based on their origin, lipidic linker, and the length and unsaturation level of their longest lipidic chain.
The lipidic linker was classified as amide, esters, ether, or amine using Daylight SMARTS language and RDKit.
The origin of the natural products was obtained as described above.
The unsaturation number was calculated by counting the number of double and triple bonds in the longest lipidic chain.
The fraction of unsaturation was calculated by doubling the number of unsaturation and dividing it by the number of atoms present in the longest lipidic chain.
The lipidic chains were identified as described above, and the number of unsaturation and atoms was calculated using RDKit.
TMAP was used to visualize the 67,730 entries COCONUT subset.
MAP4 was calculated using the related open-source code.
The indices generated by the MinHash procedure of the MAP4 calculation were used to create a 32 trees locality-sensitive hashing (LSH) forest.
Then, for each structure to display, the 20 approximate nearest neighbors (NNs) were obtained from the LSH forest, and the TMAP minimum spanning tree layout was calculated.
The LSH forest and the minimum spanning tree layout were calculated using the TMAP open-source code.
Finally, Fearun4 was used to display the obtained tree layout interactively.
The MW, Fsp3, HBD and HBA count, AlogP, taxonomical origin, presence/absence of a peptidic substructure, presence/absence of a glycoside substructure, the Lipinski classification, and the lipidation of the displayed structures were used to color code the interactive TMAP.
The resulting interactive map is online!