Skip to content

Latest commit

 

History

History
121 lines (85 loc) · 6.93 KB

README.md

File metadata and controls

121 lines (85 loc) · 6.93 KB

articSlurmy

Introduction

articSlurmy is a Bash based implementation of the ARTIC networks nCoV-2019 novel coronavirus bioinformatics protocol for Slurm clusters, with added downstream analysis and QC. It is not dependent on a workflow system such as Nextflow and submits Slurm job arrays for each ONT flowcell of output directly. articSlurmy was developed at Northumbria University Newcastle, and is used by NU-OMICS on their HPC facility for their COG-UK sequencing effort. A typical run of a GridION should only take about 10-30 minutes to analyse depending on depth of data produced.

Features

  • Submit script with input/output sanity checking
  • Upload ready output for COG-UK
  • Extensive QC with, global depth, N counts and amplicon level readouts, using mosdpeth and bedtools
  • Variant annotation with ANNOVAR
  • Lineage assignment with Pangolin

Installation

1) Setting up the conda environments

Clone this repo into your home dir:

git clone https://github.com/MattBashton/articSlurmy

Next setup the artic network conda environment as explained here:

git clone https://github.com/artic-network/artic-ncov2019.git
cd artic-ncov2019
conda env create -f environment.yml
cd

Tip: if resolving the environment is taking a long time Ctrl-C and try again setting:conda config --set channel_priority strict

Next activate this environment and install additional channels and dependencies:

conda activate artic-ncov2019
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install pigz
conda install mosdepth
conda install bedtools

2) Installing ANNOVAR

Next we need to install ANNOVAR for variant annotation, which sadly is not available for easy conda based installation, via the registration form, articSlurmy will assume you have installed it to ~/annovar with the "SARS-CoV-2" annotation package provided by the author (see "2020Apr28" and "2020Jun08" updates on the main page) into ~/annovar/sarscov2db

3) Installing Pangolin

Finally pangolin (Phylogenetic Assignment of Named Global Outbreak LINeages) need to be installed into its own conda environment to provide lineage assignment:

conda deactivate
cd
git clone https://github.com/cov-lineages/pangolin
cd pangolin
conda env create -f environment.yml
conda activate pangolin
pip install .
conda deactivate

4) Check default setting are correct for your set-up

Finally you should edit/check some local config settings in artic.sh, the # Some defaults section here details a few paths and run settings related to amplicons and primers you might want to check, specifically you might want to change PRIMERS="nCoV-2019/V3" to PRIMERS="nCoV-2019/V4" depending on which version of the artic primers you are using. I find $TMPDIR is often not configured correctly on Slurm clusters so create my own temp dir using mktemp, however the -p prefix will need pointing at the right path on the worker nodes. e.g. /scratch, /tmp, or /local etc. depending on your local setup.

Usage

Before submitting your first run you will need to create a tab delimited .tsv file, format: barcode central_sample_id - this ensures output is in ready to upload format and created with correct central_sample_id prefix. This should look like:

barcode01	CENT-188D9E
barcode02	CENT-188F98
barcode03	CENT-18994E

Submit a run via:

submit_ARTIC_run.sh <tab delimited barcode dir - sample ID mapping> <input run dir> <sequencing_summary.txt> <output dir for analysis> <global run name>

e.g.:

submit_ARTIC_run.sh sample_sheet.tsv run_output/ run_output/sequencing_summary.txt analysis_output 2020-05-25_MACHINE-ID_CENT-0001_FLOWCELL-ID

Output

The output directory specified above will include two subdirs, i) processed/ which contains all output generated for all samples in the run, prefixed with their central_sample_ids, and ii) upload/ which within a global run name directory has subdirectories for each central_sample_id that contain alignment.bam and consensus.fa ready for submission to CLIMB if you are contributing to COG-UK:

upload/
└── 2020-05-25_MACHINE-ID_CENT-0001_FLOWCELL-ID/
    ├── CENT-188D9E
    │   ├── alignment.bam
    │   └── consensus.fa
    ├── CENT-188F98
    │   ├── alignment.bam
    │   └── consensus.fa
    ├── CENT-18994E
    │   ├── alignment.bam
    │   └── consensus.fa

The output directory will also contain artic.oJOBID.N and artic.eJOBID.N files in Sun Grid Engine style, the standard out files contained detailed QC data as outlined in Features above, in addition to files written to the processed/ directory.

Run stats report

Once a run is complete the getStats.sh script located in the output dir can be run to generate a run report which is also written to the file run_stats.tsv this contains details of total reads, depth, consensus Ns, variants called and lineage assignment per barcode/sample. This takes the form of:

central_sample_id  barcode    total_reads  post_guppyplex_reads  %reads_carried_into_analysis  aligned_reads  input_depth  ouput_depth  number_consensus_Ns  %N     variants_called  lineage
CENT-188D9E        barcode01  540824       532110                98.39                         520105         6723.14      401.12       124                  0.41   14               B.1.1
CENT-188F98        barcode02  179183       174903                97.61                         173812         2200.65      371.30       658                  2.20   3                B.3
CENT-18994E        barcode03  3026802      2837018               93.73                         2794672        35952.53     431.82       123                  0.41   11               B.1.56

In addition to the above per-amplicon coverage and N count stats are also given in the standard out files for each task/sample as well .tsv files in the processed/ directory which are prefixed with sample ID.

Similar work

The Connar lab ncov2019-artic-nf Nextflow pipeline has more or less similar functionality to this pipeline.