diff --git a/README.md b/README.md index d0fd968..f477528 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,14 @@ ## About -This site presents How-to Guides and other associated documentation that supports the reuse of bioinformatics tools, workflows and data on Australian compute systems and infrastructure. +This site presents a collection of step-by-step guides that support the reuse of Galaxy workflows created in collaboration with the Bioplatforms Australia Threatened Species Initiative (TSI). There are guides for genome assembly and quality control, RAD-seq analysis with Stacks, and more! +Other guides are available at the BioCommons [How-to Hub](https://australianbiocommons.github.io/how-to-hub), the central location for all guides and associated documents that have been prepared by community members who gather around BioCommons activities. -## Article template - -See [`guide_template.md`](./about/guide_template.md) ## Acknowledgements -This work is supported by the [Australian BioCommons](https://www.biocommons.org.au/) via funding from [Bioplatforms Australia](https://bioplatforms.com/), the Australian Research Data Commons (https://doi.org/10.47486/PL105) and the Queensland Government RICF programme. Bioplatforms Australia and the Australian Research Data Commons are funded by the National Collaborative Research Infrastructure Strategy (NCRIS). +This work is supported by the [Australian BioCommons](https://www.biocommons.org.au/) via funding from [Bioplatforms Australia](https://bioplatforms.com/) and the Queensland Government RICF programme. Bioplatforms Australia is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). + +These guides were first developed as part of the [Australian BioCommons BYOD Expansion Project](https://www.biocommons.org.au/byo-data-platform-expansion), which was is funded through NCRIS investments from Bioplatforms Australia and the Australian Research Data Commons ([http://doi.org/10.47486/PL105]) that were matched by co-investments from AARNet, Melbourne Bioinformatics, NCI, Pawsey, QCIF via the Queensland Government RICF fund, The University of Sydney, AGRF, Griffith University and Monash University. This repository makes use of the ELIXIR toolkit theme: [![theme badge](https://img.shields.io/badge/ELIXIR%20toolkit%20theme-jekyll-blue?color=0d6efd)](https://github.com/ELIXIR-Belgium/elixir-toolkit-theme) diff --git a/genome_annotation/Fgenesh.md b/genome_annotation/Fgenesh.md index 33e55d0..cb7e067 100644 --- a/genome_annotation/Fgenesh.md +++ b/genome_annotation/Fgenesh.md @@ -6,7 +6,7 @@ description: How-to Guide for genome annotation with FgenesH++. affiliations: [University of Sydney, Australian BioCommons, Bioplatforms Australia, Galaxy Australia, Threatened Species Initiative] --- -[Galaxy Australia](https://usegalaxy.org.au/) is capable of conducting genome annotation using the FgenesH++ annotation tool. +[Galaxy Australia](https://usegalaxy.org.au/) is capable of conducting genome annotation using the FgenesH++ annotation tool. Users need to apply for access to this tool, please see [service notes here](https://www.biocommons.org.au/fgenesh-plus-plus) and apply for access [here](https://site.usegalaxy.org.au/request/access/fgenesh). This How-to-Guide will describe the steps required to annotate your genome on the Galaxy Australia platform (see **Fig 1**), developed in consultations between the Bioplatforms Australia [Threatened Species Initiative](https://threatenedspeciesinitiative.com/), [Galaxy Australia](https://usegalaxy.org.au/), and the [Australian BioCommons](https://www.biocommons.org.au/). @@ -16,22 +16,23 @@ If you need help, the Galaxy community is both approachable and helpful. [Ask th ## Quick start guide 1. [Login to Galaxy Australia](#register-and-login) -2. Create a new history -3. Upload your `assembled reference genome`, `repeat masked reference genome`, `.cdna`, `.pro` and `.dat` files from the [transcriptome workflow](Transcriptome) -4. Load and execute workflows, using required options +2. [Apply for access to FGenesH++](https://site.usegalaxy.org.au/request/access/fgenesh). +3. Create a new history +4. Upload your `assembled reference genome`, `repeat masked reference genome`, `.cdna`, `.pro` and `.dat` files from the [transcriptome workflow](Transcriptome) +5. Load and execute workflows, using required options - [Open `FgenesH++ genome annotation` workflow](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=881) -5. Review workflow report and perform additional QC as needed -6. Re-run workflows, or individual tools, as needed +6. Review workflow report and perform additional QC as needed +7. Re-run workflows, or individual tools, as needed ## How to cite the workflow -> Silver, L. (2024). Fgenesh annotation -TSI. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.881.1 +> Silver, L. (2024). Fgenesh annotation -TSI. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.881.4 ## The overall workflow -{% include image.html file="/genome_annotation/Fig1.png" caption="Fig 1. The approach described in this How-to-Guide, including Quick Start guide steps 1) registration, 2) upload of input files, 3) FgenesH++ genome annotation Required workflow steps are blue, and optional steps are red." max-width="10" %} +{% include image.html file="/genome_annotation/Fig1-updated.png" caption="Fig 1. The approach described in this How-to-Guide, including Quick Start guide steps 1) registration, 2) upload of input files, 3) FgenesH++ genome annotation Required workflow steps are blue, and optional steps are red." max-width="10" %} Further to this, a summary of the different elements of this assembly approach are detailed below: @@ -63,7 +64,7 @@ Further to this, a summary of the different elements of this assembly approach a {:start="3"} -3. Upload your assembled reference genome and masked reference genome (Link to repeat masking workflow), as well as the `.cdna`, `.pro` and `.dat` output from your [transcriptome assembly](Transcriptome) +3. Upload your assembled reference genome and masked reference genome (Link to repeat masking workflow), as well as the `.cdna`, `.pro` and `.dat` output from your [transcriptome assembly](Transcriptome). Note: it is recommended by Softberry that the genome is hard-masked rather than soft-masked. ### Run the annotation workflow @@ -79,10 +80,10 @@ Further to this, a summary of the different elements of this assembly approach a {:start="4"} 4. The workflow invocation window will open. -5. Select your reference genome fasta file (Step 1 in Fig 5), -6. Select your repeat masked reference genome fasta file (Step 2 in Fig 5). +5. Select your reference assembled genome fasta file (Fig 5). +6. Select your repeat masked reference genome fasta file (Fig 5). -{% include image.html file="/genome_annotation/Fig5.png" caption="Fig 5." max-width="10" %} +{% include image.html file="/genome_annotation/Fig5-updated.png" caption="Fig 5." max-width="10" %} {:start="7"} diff --git a/genome_annotation/Transcriptome.md b/genome_annotation/Transcriptome.md index 04abe5a..ccf6df5 100644 --- a/genome_annotation/Transcriptome.md +++ b/genome_annotation/Transcriptome.md @@ -31,7 +31,7 @@ If you need help, the Galaxy community is both approachable and helpful. [Ask th ## How to cite the workflows -> Silver, L., & Syme, A. (2024). Repeat masking - TSI. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.875.2 +> Silver, L., & Syme, A. (2024). Repeat masking - TSI. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.875.3 > Silver, L., & Syme, A. (2024). QC and trimming of RNAseq reads - TSI. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.876.1 @@ -53,10 +53,10 @@ Further to this, a summary of the different elements of this alignment approach | Process name | Workflow name | Description | Inputs | Outputs | | ---------------- | ----------------------------------------- | ------------------------------------------------------------------------------------ | ------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | | UPLOAD FILES | Not applicable | See the [different upload options](#upload-data-files). | reference genome, Fastq mRNA | Uploaded data! | -| Repeat Masking | [Repeat masking - TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=875) | Repeat masking of reference genome | Reference genome | FASTA file, Statistic file +| Repeat Masking | [Repeat masking - TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=875) | Repeat masking of reference genome | Reference genome | FASTA files of hard-masked and soft-masked genomes, Statistic file | RNA seq QC and trimming| [QC and trimming of RNAseq reads -TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=876) | Trimming of fastq files, including a fastqc step | Raw mRNA sequencing files | FASTQC report, Paired read FASTQ file | -| Align reads to find transcripts | [Find transcripts - TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=877) | Alignment of trimmed FASTQ reads to masked reference genome | Repeat masked reference genome, paired trimmed FASTQ reads | BAM file, GTF file alignment metrics| -| Combine Transcripts | [Combine Transcripts - TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=878) | Merges individual tissue transcripts to a global transcriptome and predicts coding sequences |GTF file, closely related species coding and non-coding sequences | GTF for global transcriptome, FASTA sequences of coding transcripts | +| Align reads to find transcripts | [Find transcripts - TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=877) | Alignment of trimmed FASTQ reads to masked reference genome | (soft) repeat masked reference genome, paired trimmed FASTQ reads | BAM file, GTF file alignment metrics| +| Combine Transcripts | [Combine Transcripts - TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=878) | Merges individual tissue transcripts to a global transcriptome and predicts coding sequences |GTF file, soft-masked genome, closely related species coding and non-coding sequences | GTF for global transcriptome, FASTA sequences of coding transcripts | | Extract Longest Transcripts | [Extract Transcripts-TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=879) | Transdecoder predictions and filtering of transcripts | FASTA sequence of coding transcripts | pep.fasta, cds.fasta and gff3 file of longest isoform transcripts | | Convert Outputs | [Convert formats - TSI](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=880) | Converts outputs of transcdecoder to required inputs for FGenesH++ annotation | transdecoder-peptides.fasta, global_nucleotides.fasta |.cdna, .dat and .pro files | @@ -116,7 +116,7 @@ Further to this, a summary of the different elements of this alignment approach - Retrieve the workflows for `Align reads to find transcripts` - Import into your Galaxy Australia workflows 2. Once you have reached the workflow screen, select the ```play``` button for Align reads to find transcripts (Fig 8) -3. Select the paired forward and paired reverse trimmed reads and masked reference genome as input (Fig9), ensure you select files tagged with `#fastq_out_r1_paired` and `#fastq_out_r2_paired` +3. Select the paired forward and paired reverse trimmed reads and soft-masked reference genome as input (Fig9), ensure you select files tagged with `#fastq_out_r1_paired` and `#fastq_out_r2_paired` 4. Check the mapping summary file for each tissue to make sure there are high mapping rates to the genome 5. Make a dataset collection containing gtf files for all tissue transcriptomes @@ -137,6 +137,9 @@ Further to this, a summary of the different elements of this alignment approach {:start="3"} 3. Search for your species on [NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi) to find the most closely related species which has an NCBI RefSeq annotation (Fig 11) + +{% include image.html file="/transcriptome/Fig11.png" caption="Fig 11." max-width="10" %} + 4. Go to the NCBI ftp server and locate the entry for this species (e.g. Corroborree frog RefSeq entry is GCF_028390025.1 and ftp entry is https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/028/390/025/) 5. Download the `_cds_from_genomic.fna.gz` and `pseudo_without_product.fna.gz` files to your local computer and upload into Galaxy (Fig 12) @@ -144,7 +147,7 @@ Further to this, a summary of the different elements of this alignment approach {:start="6"} -6. Select the gtf collection, masked reference genome, coding sequences and pseudo coding sequences in the combine transcripts workflow +6. Select the gtf collection, soft-masked reference genome, coding sequences and pseudo coding sequences in the combine transcripts workflow 7. In Step 7 of the workflow ensure the masked genome is selected and that in Step 10 of the workflow type "1" in the `List of Fields` box (Fig 13; Fig 14; Fig 15) {% include image.html file="/transcriptome/Fig13.png" caption="Fig 13." max-width="10" %} @@ -186,13 +189,11 @@ Further to this, a summary of the different elements of this alignment approach {:start="3"} -3. Select the transdecoder peptide fasta file and the text transformed fasta output file from the Combine Transcripts workflow (Fig 19; Fig20) - -{% include image.html file="/transcriptome/Fig19.png" caption="Fig 19." max-width="10" %} +3. Select the transdecoder peptide fasta file and the text transformed fasta output file from the Combine Transcripts workflow (Fig 19) -{% include image.html file="/transcriptome/Fig20.png" caption="Fig 20." max-width="10" %} +{% include image.html file="/transcriptome/Fig19_updated.png" caption="Fig 19." max-width="10" %} {:start="4"} -4. The output files tagged with `#dat`, `#pro`, and `#cdna`, along with the masked and unmasked reference genome are used as input files for [FGenesH++ genome annotation](Fgenesh) +4. The output files tagged with `#dat`, `#pro`, and `#cdna`, along with the hard-masked and unmasked reference genome are used as input files for [FGenesH++ genome annotation](Fgenesh). diff --git a/genome_assembly/hic-scaffolding.md b/genome_assembly/hic-scaffolding.md new file mode 100644 index 0000000..c5c1e04 --- /dev/null +++ b/genome_assembly/hic-scaffolding.md @@ -0,0 +1,219 @@ +--- +title: Genome scaffolding with Hi-C on Galaxy Australia +type: genome-assembly +contributors: [Luke Silver, Anna Syme] +description: This guide describes the steps required to scaffold your genome on the Galaxy Australia platform using HiC data +affiliations: [Bioplatforms Australia, Galaxy Australia, Australian BioCommons, Threatened Species Initiative] +--- + +This How-to-Guide will describe the steps required to scaffold your genome on the Galaxy Australia platform using a HiC scaffolding workflow developed by the [Vertebrate Genomes Project](https://vertebrategenomesproject.org/), and modified by the [Galaxy Australia](https://usegalaxy.org.au/) team in consultations with the Bioplatforms Australia [Threatened Species Initiative](https://threatenedspeciesinitiative.com/) and the [Australian BioCommons](https://www.biocommons.org.au/). + +This workflow has been created from a Vertebrate Genomes Project (VGP) scaffolding workflow. +* For more information about the VGP project see the [Galaxy-VGP project page](https://galaxyproject.org/projects/vgp). +* The VGP scaffolding workflow is hosted at [WorkflowHub](https://workflowhub.eu/workflows/625). +* Some minor changes have been made to better fit with TSI project data: optional inputs of SAK info and sequence graph have been removed; the required input format for the genome is changed from gfa to fasta; and the estimated genome size now requires user input rather than being extracted from output of a previous workflow. + +Please see the HiC Scaffolding section in the [VGP assembly tutorial](https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html) for additional information about this workflow. + +Note: If you initially assembled the genome with HiFi data only, and you have new HiC data, you may wish to consider re-assembling the genome with the [VGP HiFi-HiC assembly pipeline](https://workflowhub.eu/workflows/612) which can give better results than using HiFi data alone. + +{% include callout.html type="note" content="If you need help, the Galaxy community is both approachable and helpful. [Ask them questions!](https://help.galaxyproject.org/)" %} + +### Register and login + +1. To register for Galaxy Australia, visit the [login page](https://usegalaxy.org.au/login). +2. Click the ```Register here``` link. +3. Complete the registration wizard and click ```Create```. +4. Login to your account! + +### Upload data files + +Please see the how-to guide for HiFi genome assembly for additional information about uploading data from the Bioplatforms Australia Data Portal. + +* In Galaxy Australia, create a new history +* Import data + * ```assembly.fasta```. This file may be in one of your Galaxy histories. + * HiC data: concatenated ```HiC_F.fastqsanger.gz```, concatenated ```HiC_R.fastqsanger.gz``` + +### Import the scaffolding workflow + +Please see the how-to guide for HiFi genome assembly for additional information about how to import and run workflows. + +* [Visit this link](https://usegalaxy.org.au/workflows/trs_import?trs_server=workflowhub.eu&run_form=true&trs_id=1054) to: + - Retrieve the workflow for `TSI-Scaffolding-with-HiC` + - Import into your Galaxy Australia workflows + +### Run the workflow + +* Click on the Workflow tab, find this workflow and click on the triangle run button. +* Add in the required inputs: + * assembly.fasta + * restriction enzymes + * HiC forward and reverse reads - these need to be a single concatenated file for each set, and in fastqsanger.gz format + * Estimated genome size as integer + * Lineage for BUSCO +* Click Run + +### What the workflow does + +
Step + | +Inputs + | +Tool + | +Outputs + | +
Map HiC reads to genome + | +assembled_genome.fasta
+ +HiCR1.fastqsanger.gz + +HiCR2.fastqsanger.gz + |
+ BWA MEM 2 + | +HiCR1.bam
+ +HiCR2.bam + |
+
Merge bams + | +HiCR1.bam
+ +HiCR2.bam + |
+ Filter and merge + | +HiC.bam + | +
Make pre-scaffolding pretex map + | +HiC.bam + | +Pretext map + | +Pretext map output + | +
Make pre-scaffolding pretex map snapshot + | +Pretext map output + | +Pretext snapshot + | +HiC contact map
+ +(view) + |
+
Scaffold + | +assembled_genome.fasta
+ +HiC.bam + |
+ YAHS + | +scaffolded_assembly.fasta + | +
Map HiC reads to scaffold + | +scaffolded_assembly.fasta
+ +HiCR1.fastqsanger.gz + +HiCR2.fastqsanger.gz + |
+ BWA MEM 2 + | +HiCR1scaffold.bam
+ +HiCR2scaffold.bam + |
+
Merge bams + | +HiCR1scaffold.bam
+ +HiCR2scaffold.bam + |
+ Filter and merge + | +HiCscaffold.bam + | +
Make post-scaffolding pretex map + | +HiCscaffold.bam + | +Pretext map + | +Pretext map output scaffold + | +
Make post-scaffolding pretex map snapshot + | +Pretext map output scaffold + | +Pretext snapshot + | +HiC contact map scaffold
+ +(view, compare to pre-scaffold map) + |
+
+ | ++ | ++ | ++ | +