Dartmouth CQB RNA-seq analysis pipeline

Introduction

The pipeline is designed to provide efficient pre-processing and quality control of bulk RNA-sequencing (RNA-seq) data on high performance computing clusters (HPCs) using the Torque/PBS scheduler or using a single high CPU, high RAM machine, and has been made available by the Data Analytics Core (DAC) of the Center for Quantitative Biology (CQB), located at Dartmouth College. Both single- and paired-end datasets are supported, in addition to both library preparation methods for full-length or 3'-only analysis. The pipeline has been built and tested using human and mouse data sets. Required software can be installed using Conda with the enrionment file (environment.yml), or specified as paths in the config.yaml file.

Pipeline summary:

The major steps implmented in the pipeline include:

FASTQ quality control assesment using FASTQC
Read trimming for Poly-A tails, specified adapters, and read quality using cutadapt
Alignment using Hisat or STAR
Quantification with Featurecouts and RSEM

All of these tools can be installed in a conda environment or on paths available to a computing server. As input, the pipeline takes raw data in FASTQ format, and produces quantified read counts (using HTSeq-Count or RSEM) as well as a detailed quality control report (including pre- and post-alignment QC metrics) for all processed samples. Quality control reports are aggregated into HTML files using MultiQC.

Implementation

The pipeline uses Snakemake to submit jobs to the scheduler, or spawn processes on a single machine, and requires several variables to be configured by the user when running the pipeline:

sample_tsv - A TSV file containing sample names and paths to fastq paths. See example in this repository for formatting.
layout - Either "single" or "paired" library construction.
aligner_name - 'hisat' or 'star'
aligner_index - Path to Hisat or STAR genome reference index
picard_rrna_list - Absolute path to coordinates of ribosomal RNA sequences in reference genome, in interval-list format
picard_refflat - Absolute path to genome annotation in RefFlat format
annotation_gtf - Absolute path to genome annotation file (.gtf) of Featurecouts or RSEM
picard_strand - "FIRST_READ_TRANSCRIPTION_STRAND" "SECOND_READ_TRANSCRIPTION_STRAND"
featurecounts_strand - "1" or "2" #1 for first read transcription strand, 2 for second.*

Running tests using pre-built environments on Discovery

Clone this repository:

git clone https://github.com/Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline.git
cd DAC-RNAseq-pipeline

Activate an environment containing Snakemake:

conda activate /dartfs/rc/nosnapshots/G/GMBSR_refs/envs/snakemake

Build, configure, and check reference files:

# INDEX SEQUENCE/ANNOTATION FILES FOR MAPPING 
snakemake -s Snakefile  --use-conda -j 6 --conda-prefix /dartfs/rc/nosnapshots/G/GMBSR_refs/envs/DAC-RNAseq-pipeline build_refs

# ADD REFERENCE DETAILS TO CONFIG FILE
cat ref/pipeline_refs/hg38_chr567_100k.entries.yaml >> config.yaml

# CHECK REFERENCE IS CORRECTLY FORMATED
snakemake -s Snakefile  --use-conda -j 6 --conda-prefix /dartfs/rc/nosnapshots/G/GMBSR_refs/envs/DAC-RNAseq-pipeline check_refs

Run the pipeline:

snakemake -s Snakefile  --use-conda -j 6 --conda-prefix /dartfs/rc/nosnapshots/G/GMBSR_refs/envs/DAC-RNAseq-pipeline

Running the pipeline using pre-built references and config files on Discovery

The DAC has made public references and their corresponding aligner index and annotation files available to the Dartmouth community on Discovery/DartFS. Additional documentation on the public references can be found in thieir repository. Pre-built config.yaml files for this RNA-Seq pipeline have also been added to the prebuilt_configs directory of this repository. As of 4/29/24, there are configs for any combination of human/mouse, single/paired reads, and Hisat2/STAR/RSEM. An example of using a prebuilt config for human, Hisat2, paired-end reads is as follows:

snakemake -s Snakefile --configfile prebuilt_configs/human_config_paired_hisat.yaml  --use-conda -j 6 --conda-prefix /dartfs/rc/nosnapshots/G/GMBSR_refs/envs/DAC-RNAseq-pipeline

When using a pre-built config, one will still have to create a sample_fastq_list.txt for each specific run, and ensure this file is specified correctly at the top of the config file.

More Command Line Examples

Submit the pipeline to a single machine, allowing usage of 40 cores:

snakemake --use-conda -s Snakefile -j 40

Submit the pipeline to a computing cluster using the profile defined in cluster_profile/config.yaml, and allow jobs to be re-run twice in case of failure:

snakemake --use-conda -s Snakefile --profile cluster_profile -T 2

Snakemake job graph example for three samples:

Contact & questions: Please address questions to [email protected] or submit an issue in the GitHub repository.

This pipeline was created with funds from the COBRE grant 1P20GM130454. If you use the pipeline in your own work, please acknowledge the pipeline by citing the grant number in your manuscript.

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
.github/workflows		.github/workflows
cluster_profile		cluster_profile
env_config		env_config
img		img
prebuilt_configs		prebuilt_configs
sample_data		sample_data
sample_ref		sample_ref
scripts		scripts
tests		tests
.DS_Store		.DS_Store
.Rapp.history		.Rapp.history
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
config_GRCm39_paired.yaml		config_GRCm39_paired.yaml
config_GRCm39_single.yml		config_GRCm39_single.yml
downsample.sh		downsample.sh
job.script.sh		job.script.sh
multiqc_config.yaml		multiqc_config.yaml
sample_fastq_list_paired.txt		sample_fastq_list_paired.txt
sample_fastq_list_single.txt		sample_fastq_list_single.txt
snakemake_environment.yaml		snakemake_environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dartmouth CQB RNA-seq analysis pipeline

Introduction

Pipeline summary:

Implementation

Running tests using pre-built environments on Discovery

Running the pipeline using pre-built references and config files on Discovery

More Command Line Examples

Snakemake job graph example for three samples:

About

Releases

Packages

Contributors 6

Languages

Dartmouth-Data-Analytics-Core/DAC-RNAseq-pipeline

Folders and files

Latest commit

History

Repository files navigation

Dartmouth CQB RNA-seq analysis pipeline

Introduction

Pipeline summary:

Implementation

Running tests using pre-built environments on Discovery

Running the pipeline using pre-built references and config files on Discovery

More Command Line Examples

Snakemake job graph example for three samples:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages