Skip to content
Nicolas Morales edited this page Dec 14, 2020 · 3 revisions

Publication: Morales N, Bauchet GJ, Tantikanjana T, Powell AF, Ellerbrock BJ, Tecle IY, et al. (2020) High density genotype storage for plant breeding in the Chado schema of Breedbase. PLoS ONE 15(11): e0240059. https://doi.org/10.1371/journal.pone.0240059

To store high density genotyping data, Breedbase relies on the JSON features of PostgreSQL. The genotype data is uploaded into Breedbase using the VCF format or a custom Intertek format, and is stored under a genotyping data project and a genotyping protocol as described above. The preferred method for handling genotyping data in Breedbase is: 1) to first store the genotyping plate in Breedbase; 2) send the genotyping plate layout along with the physical genotyping plate to the genotyping vendor; 3) upload the returned genotyping data in Breedbase. Because of historic methods where the genotyping plate was not loaded into Breedbase prior to genotyping, Breedbase allows genotyping data to still be uploaded and associated simply to germplasm names instead of genotyping plate wells. In other words, the genotyping data is linked directly to the stock table entry of the associated genotyping plate well sample or a simple germplasm name; this depends on the sample names in the returned genotyping data file, whether the returned data file uses identifiers for individual wells in the genotyping plate as discussed above in the genotyping plate section or if the returned data file uses simple germplasm names.

Each of the stock table entries that were genotyped, whether they are genotyping plate well samples or germplasm names, is linked to an entry in the genotype table via the nd_experiment_stock and nd_expeiment_genotype tables, ultimately linked by an entry in the nd_experiment table using the type name ‘genotyping_experiment’ from the ‘experiment_type’ controlled vocabulary. Each stock table entry is linked to its own nd_experiment entry, in a manner similar to how the phenotyping data is saved, as described above; the nd_experiment entry is also linked to relevant nd_protocol and project tables via the nd_experiment_protocol and nd_experiment_project tables, respectively, linking the relevant genotyping protocol and genotyping data project. The entry in the genotype table is linked to an entry in the genotypeprop table via an EAV model; the entry in genotypeprop is a JSON formatted string stored using the type name ‘vcf_snp_genotyping’ from the ‘genotype_property’ controlled vocabulary.

The two key JSON formatted objects for storing high-density genotyping data in Breedbase are an entry for the entire genotyping protocol in the nd_protocolprop table and an entry in the genotypeprop table for each of the genotyped samples. The entry in the nd_protocolprop table is a complex object with the following top-level keys: ‘reference_genome_name’, ‘species_name’, ‘header_information_lines’, ‘sample_observation_unit_type_name’, ‘marker_names’, ‘markers’, ‘markers_array’. The ‘reference_genome_name’ key stores a string value of the user defined reference genome e.g. ‘Manihotv6.1’. The ‘species_name’ stores a string value of the organism species name that the samples belong to; this species name must be in the database in the organism table prior and is only for convenience. The ‘header_information_lines’ key stores an array value of the commented header information lines from the uploaded VCF or Intertek formatted file. The ‘sample_observation_unit_type_name’ stores a string value that is either ‘accession’ or ‘tissue_sample’ and is used only to distinguish whether the genotyping protocol was used to sample germplasm names or genotyping plate tissue sample wells, as was discussed above. The ‘marker_names’ key stores an array value of all the marker names involved in the genotyping protocol. The ‘markers’ key stores a value that is an object of objects; the top-level key is the marker name and the corresponding object contains key value pairs for the chromosome, base pair position, comma separated alternate alleles, reference allele, quality, filter information, marker summary information, and marker score format information. These marker information fields are taken directly from the VCF format model. The ‘markers_array’ key stores the same information as the ‘markers’ array, but in a format suited for certain JSON formatted queries in PostgreSQL; the value is an array of objects where each object contains information about a single marker e.g. the chromosome, position, and other fields mentioned previously.

For each genotyped sample there is an entry in the genotypeprop table and the value is a JSON formatted object of objects. The top-level key is the marker name and the corresponding object value contains all genotype score information for that marker and sample; the genotype score information is stored as simple key value pairs in the corresponding object where keys come directly from the format field in the VCF specification e.g. GT, DP, GQ, etc. During genotype upload of a VCF into Breedbase, the genotypeprop table value entry is constructed by encoding all genotype score information for a sample into the JSON object of objects. During genotype upload of an Intertek custom genotyping file, only the GT key is populated for all markers in the genotypeprop value JSON object of objects.

Clone this wiki locally