Software for the identification of Genomic Safe Harbors (GSHs) in eukaryotic organisms
A Software with a graphical user interface written in Python 3.9 using standard libraries, which implies a simplified installation process and a possibility of being used on multiple platforms, such as Linux and Windows.
The input consists of (1) the genome annotation file of the target organism, in the GFF3 format (Generic Feature Format); and (2) the Features and Track files in JSON format.
For the identification of GSHs based on biological premises, a series of annotation attributes were selected to direct a search for regions with a minimum size and devoid of genes, but also that do not harbor regulatory elements, such as binding sites for transcription factors, DNA hypermethylation and state of chromatin and histone alterations, which are evaluated using information collected in external databases through the Entrez Programming Utilities (E-utilities) Application Programming Interface (API) National Center for Biotechnology Information (NCBI), Ensembl REST API and the UCSC Genome Browser.
The software performs a combinatorial analysis to identify intergenic regions and the direction of their flanking genes. Subsequently, intergenic regions presenting the genetic elements described in the track.json are excluded.
- Python 3.9
- Packages (version): bio (1.5.9); biopython (1.81); biothings-client (0.3.0; certifi (2023.5.7); charset-normalizer (3.1.0); contourpy (1.0.7); cycler (0.11.0); fonttools (4.39.4); gprofiler-official (1.0.0); idna (3.4); kiwisolver (1.4.4); matplotlib (3.7.1); mygene (3.2.2); numpy (1.24.3); packaging (23.1); pandas (2.0.2); Pillow (9.5.0); platformdirs (3.5.1); pooch (1.7.0); pyparsing (3.0.9); python-dateutil (2.8.2); pytz (2023.3); requests (2.31.0); six (1.16.0); tqdm (4.65.0); tzdata (2023.3); urllib3 (2.0.3).
To use the program, you must have an NCBI account due to the Database API Usage Guidelines and Requirements (Entrez). The email registered in this account is required.
Before running the program you will need all the following files in the same directory
-
The genome annotation file of the target organism, in the GFF or GFF3 format. (Ex.
Saccharomyces_cerevisiae.R64-1-1.49.gff3
)- If it is a multicellular organism, there is the possibility of adding a GFF file with the Regulatory Features of each cell type (Ex.
homo_sapiens-regulatory_features.gff
)
- If it is a multicellular organism, there is the possibility of adding a GFF file with the Regulatory Features of each cell type (Ex.
-
The Feature list that SHIP will use to analyze the GFF file in JSON format. (Ex.
features.json
)- At the beginning, the software will show you all the features present in the GFF file
-
The Track list that SHIP will use to check the presence in each intergenic region in UCSC Genome Browser Database in JSON format. (Ex.
tracks.json
)
Warning
Do not change the name of the JSON files.
During running the user needs to choose the following key parameters
-
Neighbour gene orientation
- Convergent (+ -)
- Divergent (- +)
- Tandem (+ + / - -)
-
Intergenic size range
- Maximum size of the intergenic region (Standard 2000)
- Minimum size of the intergenic region (Standard 1500)
The user also needs to indicate the analyses that should be performed
- UCSC Database
- Ensembl (Cross References)
- Regulatory analysis (Multicellular organisms)
All files will be generated in the same directory as the input files.
-
pGSH List
- Description of neighboring genes
- Fast sequence of the intergenic region
- Description of Regulatory
- Additional information for each neighboring gene
Homo sapiens (with Regulatory Analysis)
Practical example running the software
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Welcome on Board Sailor!
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Enter your email to access NCBI: [email protected]
Enter file name <include the path>: /Users/user1/Documents/target_organism.gff3
Enter the interval for genomic analysis (default 500): (enter)
------------------------------
ANALYZING ...
------------------------------
Performing the following steps:
[1] Selecting and sorting the genes by their start within each chromosome.
[2] Disregarding completely overlapping genes.
[3] Filtering out convergent, divergent and tandem genes.
------------------------------
features : ['gene', 'ncRNA_gene', 'pseudogene', 'centromere', 'telomere', 'long_terminal_repeat', 'mobile_genetic_element', 'origin_of_replication', 'transposable_element_gene', 'meiotic_recombination_region', 'sequence_feature']
------------------------------
BK006935.2, BK006935.2 BK006936.2, BK006936.2 BK006937.2, BK006937.2 BK006938.2, BK006938.2 BK006942.2,
BK006942.2 AJ011856.1, AJ011856.1 BK006939.2, BK006939.2 BK006940.2, BK006940.2 BK006941.2, BK006941.2 BK006934.2,
BK006934.2 BK006943.2, BK006943.2 BK006944.2, BK006944.2 BK006945.2, BK006945.2 BK006946.2, BK006946.2 BK006947.3,
BK006947.3 BK006948.2, BK006948.2 BK006949.2, BK006949.2
---------- Types that will be processed ----------
gene = 6600 transposable_element_gene = 91
ncRNA_gene = 424 pseudogene = 12
-------- Types that will not be processed --------
chromosome = 17 CDS = 6913 snoRNA = 77 snRNA = 6
mRNA = 6600 ncRNA = 18 transposable_element = 91 five_prime_UTR = 4
exon = 7507 tRNA = 299 pseudogenic_transcript = 12 rRNA = 24
--------------------------------------------------
Interval (bp) Tandem Divergent Convergent
Overlapping 142 264 184
1-500 bp 2077 842 1445
501-1,000 bp 698 518 154
1,001-1,500 bp 169 124 46
1,501-2,000 bp 74 56 10
2,001-2,500 bp 26 22 4
2,501-3,000 bp 11 13 1
3,001-3,500 bp 6 7 1
3,501-4,000 bp 8 4 1
Quantity of Chromosomes = 17
Number of genes located = 7127
Total intergenic intervals = 6925
Total Intervals flanked by Tandem genes = 3218
Total Intervals flanked by Divergent genes = 1855
Total Intervals flanked by Convergent genes = 1852
Number of complete overlaps between genes = 191
Would you like to plot the Genes chart? (Y/N): Y
Types of flanking genes:
[1] CONVERGENT (+ -)
[2] DIVERGENT (- +)
[3] TANDEM (+ +/- -)
> Enter your choice: 1
> Minimum size of the intergenic region (Standard 1500): 1300
> Maximum size of the intergenic region (Standard 2000): 2200
Do you want to perform analysis on the UCSC Database? (Y/N) Y
Do you want to perform Cross References? (Y/N) Y
Do you want to perform regulatory analysis? (Y/N) Y
Enter the species to be analyzed: _Target organism_
Enter the regulatory build Path and File name: /Users/user1/Documents/target_organism.Regulatory_Build.regulatory_features.gff
------------------------------
PROCESSING ...
------------------------------
[1] Selecting the intergenic regions with size between 1500 to 2000 bp.
[2] Searching for sequences at NCBI.
[3] Crossing information with other databases.
Wait for processing... Go have a cup of coffee.
☕︎︎ ☕︎︎ ☕︎︎ ☕︎︎ ☕︎︎ ☕︎︎ ☕︎︎ ☕︎︎
------------------------------
RESULT
------------------------------
File generated with the results: /Users/user1/Documents/target_organism_result_SHIP.txt
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Do you want to analyze another GFF file? ...
Leitão et al. SHIP: A Computational Tool for the Systematic Identification of Genomic Safe Harbors for Stable Gene Insertion in Eukaryotes. In Review
- MIT License
- Copyright (c) 2023 Matheus de Castro Leitão
- Software registered under process number BR512023002017-6