Processing of RNA-seq data using a customized workflow

Description

RNA-seq data in .fastq files were aligned to reference genome and transcriptome using the STAR (Spliced Transcripts Alignment to a Reference) program. STAR was run in the 2-pass mode. The first pass mapped reads to known splice junctions in reference transcriptome during alignment while allowing for detection of novel junction sites. Junction sites detected from multiple samples by the first pass were collected, combined and filtered. The second pass re-aligned reads to the reference genome and transcriptome, plus novel junction sites detected by the first pass. The alignment results were saved as indexed .bam files.

Aligned reads in .bam files were loaded into R using two custom functions.The first function (link) maps each aligned reads to known exons, transcripts, and genes. The second function (link) counts number of reads mapped to each gene in multiple categories. It first counts reads mapped to one and only one genes, and then reads mapped to multiple genes. If the libaries were made strand-specific, it will also count the number of reads mapped to the antisense strand of any genes. If the reads were paired, it will also count the number of reads with only one end mapped. Therefore, there are up to 8 categories of read counts, (unique/sense/paired, multiple/antisense/unpaired, etc.). These categories have different priorities and are mutually exclusive of each other. By default, reads with both ends mapped to the sense strand were counted first, followed by those with both ends mapped to the antisense strand, then by reads with only one end mapped. The count of unique/sense/paired reads is usually the final output for analysis. Finally, gene-level read counts were converted to FPKM (fragment per kilobase per million reads) to represent gene expression level.

Step by step

Prepare reference genome and transcriptome for STAR alignment:

Download newest version of STAR from https://github.com/alexdobin/STAR
Download reference genome (.fasta file) and transcriptome (.gtf/.gff file) from one of the following sources:
- iGenome: https://support.illumina.com/sequencing/sequencing_software/igenome.html
- NCBI Genome: http://www.ncbi.nlm.nih.gov/genome/51 (need to map chromosome names if download directly from NCBI)
- ENSEMBL: http://useast.ensembl.org/info/data/ftp/index.html?redirect=no
Generate genome index for STAR without using any annotation; example: https://raw.githubusercontent.com/zhezhangsh/MyMethods/master/examples/rnaseq/star/generate_ref.sh

# shell code: generate reference genome
/nas/is1/rnaseq_workspace/tools/STAR-2.5.1b/bin/Linux_x86_64_static/STAR \
--runThreadN 12 \
--runMode genomeGenerate \
--genomeDir /nas/is1/rnaseq_workspace/refs/mm38/star \
--genomeFastaFiles /nas/is1/rnaseq_workspace/refs/mm38/GCF_000001635.24_GRCm38.p4_genomic.fna

Align reads to references via STAR

Locate the full path of all fastq files
Download the yaml template to local folder

# shell code: download yaml template
wget https://raw.githubusercontent.com/zhezhangsh/Rnaseq/master/examples/RunStar/RunStar.yml RunStar.yml

Edit the yaml file for each data set, especially the following fields

# text editing: yaml fields
output: /home/zhangz/R/source/MyMethods/examples/rnaseq/star  # location of output files
junction:                                                     # options about combining novel junction sites from individual libraries
  combine: yes                                                  # combine junction sites ?
  filename: combined_SJ.out.tab                                 # file name of combined junctions
  canonical: no                                                 # include canonical sites only in the combined file?
  unannotated: yes                                              # include unannotated sites only in the combined file?
  minimum:                                                      # filtering strategies
    read: 3                                                       # the minimal number of reads mapped to the jucntion in each library
    overhang: 5                                                   # the minimal overhang of the junction in each library
    sample: 3                                                     # the minimal number of samples meeting the last 2 criteria
    total: 12                                                     # the total number of reads from all libraries mapped to the junction
qsub:                                                         # options to qsub alignment jobs to a cluster
  will: yes                                                     # will qsub jobs?
  prefix: qsub -cwd -l mem_free=32G -l h_vmem=64G -pe smp 16    # command prefix for qsub each alignment jobs
  path:                                                         # path change
    from: /nas/is1                                                # local path to the input files      
    to: /mnt/isilon/cbmi/variome                                  # cluster path for qsub
options:                                                      # other options
   - genomeDir                                                    # directory of indexed reference genome
   - sjdbGTFfile                                                  # full path to gene annotation gtf file   
fastq:                                                        # list of fastq files, name each library
  C2863:                                                        # name of library
    fastq1: /nas/is1/zhangz/projects/simmons/fastq/C2863_1.fq.gz
    fastq2: /nas/is1/zhangz/projects/simmons/fastq/C2863_2.fq.gz

Generate code to perfrom STAR alignment and other functions. Code for each pass includes:
A RunStar.sh file for each sample; example: https://raw.githubusercontent.com/zhezhangsh/MyMethods/master/examples/rnaseq/star/pass_1/STAR_C2863.sh
A qsub.sh file for qsub-ing all RunStar.sh files; example: https://raw.githubusercontent.com/zhezhangsh/MyMethods/master/examples/rnaseq/star/pass_1/qsub.sh
Extra script that delete temporary SAM files and merge junction sites; example: https://raw.githubusercontent.com/zhezhangsh/MyMethods/master/examples/rnaseq/star/pass_1/script/combined_junction.r and https://raw.githubusercontent.com/zhezhangsh/MyMethods/master/examples/rnaseq/star/pass_1/script/delete_sam.sh

# shell code: qsub the first pass alignment
sh <PATH>/pass_1/qsub.sh

# shell code: combine novel junction sites of individual libraries for the second pass alignment
Rscript <PATH>/pass_1/script/combine_junction.r

# shell code: remove temparory sam files from the first pass
sh delete_sam.sh

# shell code: qsub the second pass alignment
sh <PATH>/pass_2/qsub.sh

Load bam files to get gene-level read counts

Locate all indexed bam files
Download the yaml template to local folder

# shell code: download yaml template
wget https://raw.githubusercontent.com/zhezhangsh/MyMethods/master/examples/rnaseq/load_bam/LoadBamScript.yml LoadBamScript.yml

Edit the yaml file for each data set, especially the following fields

# text editing: yaml fields
output: /home/zhangz/R/source/MyMethods/examples/rnaseq/load_bam              # Location of output files
exon: /mnt/isilon/cbmi/variome/rnaseq_workspace/refs/mm38/GRCm38_exon.rds     # Exon annotation. A GRanges object, with the transcript_id and gene_id fields
paired: yes                                                                   # PE and SE reads
strand: -1.0                                                                  # 0 if libraries not stranded; -1 if using common protocol
split:                                                                        # split reads into subgroups by ID
  split: yes                                                                  
  separator: ':'                                                                 # separator of read ID syllabus
  level: 3                                                                       # number of syllabus in read ID to split subgroups
R: /home/zhangz/miniconda3/envs/zhangz/bin/Rscript                            # the location of R program on the cluster
qsub: qsub -cwd -l mem_free=64G -l h_vmem=64G -pe smp 16                      # qsub prefix; adjust mem_free corresponding to file size
region:                                                                       # specify chromosomes and locations
  NC_000067.6:
  - 1.0
  - 1.9547197e+08
  NC_000068.7:
  - 1.0
  - 1.8211322e+08
bam:                                                                          # sample name and bam file full path
  2863C-L1F: /mnt/isilon/cbmi/variome/zhangz/projects/simmons/2016-02_RNAseq/star/pass_2/2863C-L1F_Aligned.sortedByCoord.out.bam
  2863C-L1FL: /mnt/isilon/cbmi/variome/zhangz/projects/simmons/2016-02_RNAseq/star/pass_2/2863C-L1FL_Aligned.sortedByCoord.out.bam

Download the R script to generate qsub code, run the script, and qsub jobs; example: https://raw.githubusercontent.com/zhezhangsh/MyMethods/master/examples/rnaseq/load_bam/LoadBamScript.r and https://github.com/zhezhangsh/MyMethods/edit/master/examples/rnaseq/load_bam/qsub.sh

# shell code: download R script
wget https://github.com/zhezhangsh/MyMethods/edit/master/examples/rnaseq/load_bam/LoadBamScript.r LoadBamScript.r 

# shell code: generate qsub commands
Rscript LoadBamScript.r  # a qsub.sh file will be generated in the current folder

# shell code: qsub jobs to cluster
sh qsub.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process_RNAseq.md

Process_RNAseq.md

Processing of RNA-seq data using a customized workflow

Description

Step by step

Prepare reference genome and transcriptome for STAR alignment:

Align reads to references via STAR

Load bam files to get gene-level read counts

Files

Process_RNAseq.md

Latest commit

History

Process_RNAseq.md

File metadata and controls

Processing of RNA-seq data using a customized workflow

Description

Step by step

Prepare reference genome and transcriptome for STAR alignment:

Align reads to references via STAR

Load bam files to get gene-level read counts