forked from geparada/RNA_seq_course2020
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathChapterIV_Pipelines_and_reproducible_bioinformatics.Rmd
174 lines (120 loc) · 8.76 KB
/
ChapterIV_Pipelines_and_reproducible_bioinformatics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
title: "ChapterIV Pipelines and reproducible bioinformatics"
output: html_notebook
---
# 5. Introduction to reporducible bioinformatics
Until this point, we have introduced every command mannually in the terminal. Please check how many files do we have at `FASTQ/`. Do you think you can get the raw counts from all the samples without having errors? what about processing hundred of samples, is it reasonable to do it manually?
The answer is clearly *no*. We need systematic ways to process data to avoid errors and enable reproducibility in our analyses. This is why we are going to use a workflow manager to execute the remaining steps to get the raw counts from our samples.
Snakemake is currently one most popular workflow managers to work with bioinformatic software. To install snakemake please the following command:
`conda create -n snakemake_env snakemake `
By doing this we created a virtual environment, called `snakemake_env` that has latest version of snakemake and our it dependencies installed. Now to activate this environment, write:
`conda activate snakemake_env`
As we already wrote a `snakemake` pipeline for you, we are going to demonstrate you some of its properties and how to use it. To see the code from the this pipeline, Use `Atom` (or any text editor) to open the `Snakefile`. Can you recognize some of the steps that we have done already?. Every `rule`represent a step of the analysis. For example:
```{python}
rule hisat2_Genome_index:
input:
"Genome/dm6.fa"
output:
"Genome/Index/dm6.1.ht2"
threads: 7
conda:
"envs/core.yaml"
log:
"logs/hisat2_Genome_index.log"
shell:
"hisat2-build -p {threads} {input} Genome/Index/dm6 2> {log}"
```
This code correspond to the indexing step, in which it takes `Genome/dm6.fa` as input and `Genome/Index/dm6.1.ht2` as output. The rules have several `key` words by which different parts of the command are declared:
* input: set of intput files, in this case just "Genome/dm6.fa"
* output: set of output files. In this case, more output files are created, but they do not have to be pointed by the commands, we can just refer to one of the files that is created. Snakemake will check if this file is successfully created after the process is finished.
* threads: number of processors
* conda: the virtual environment in which the process will be run.
* log : file that store anything that hisat2 outputs while is creating the index.
* shell : This is the formula to create the shell command given all the parameters described above.
The indexing rule is directly connected to the following mapping rule:
```{python}
rule hisat2_to_Genome:
input:
fastq = "FASTQ/{sample}.fastq.gz",
genome = "Genome/Index/dm6.1.ht2"
output:
temp("hisat2/{sample}.sam") # Temporary output
threads: 3
conda:
"envs/core.yaml"
log:
"logs/hisat2_to_Genome.{sample}.log"
shell:
"hisat2 -p 3 -U {input.fastq} -x Genome/Index/dm6 > {output} 2> {log}"
```
This rule takes fastq files as input and also the genome index files. All the sample names were obtained from`NCBI_accession_list.txt` file, which contains the SRA accession codes corresponding to all the samples that we are analysing. Inside the rule `{sample}` takes the value of every accession code, and allow snakemake to generate all the mapping commands for every sample. As we here set `threads` as 3, every mapping process will use 3 processors, which means that 2 mapping processes can be run in parallel when 7 cores are provided.
The next rule `bamstats` take every SAM file and transform it to BAM, but also the BAM file is sorted and indexed at the same time:
```{python}
rule samTobam:
input:
"hisat2/{sample}.sam"
output:
"hisat2/{sample}.sorted.bam"
conda:
"envs/core.yaml"
shell:
"samtools view -b {input} | samtools sort - -o {output} && samtools index {output} "
```
Because SAM files were produced as temporary files (`temp("hisat2/{sample}.sam")` ),
as soon as `samTobam` finishes, SAM files are deleted. This optimizes the disk space, which is important when a large number of samples are processed.
Finally, all these steps converge at:
```{python}
rule featureCounts:
input:
gtf = "Gene_annotation/dm6.Ensembl.genes.gtf",
bam = expand("hisat2/{sample}.sorted.bam", sample=SAMPLES)
output:
"featureCounts/total_samples.gene_count.txt"
threads: 1
conda:
"envs/core.yaml"
log:
"logs/featureCounts.total.log"
shell:
"featureCounts -a {input.gtf} -o {output} {input.bam} 2> {log}"
```
Where `{input.gtf}` list all the sorted bam that we generated. Notice that this Snakefile starts with `include: "rules/00_download_data.skm" `, which is a statement that connects this Snakefile with `rules/00_download_data.skm`. This is a script that read all accession codes from `NCBI_accession_list.txt` and stores those at `SAMPLES`, which is then used by `expand("hisat2/{sample}.sorted.bam", sample=SAMPLES)` to generate the list of all sorted BAM files.
## 5.1 Quantifying all the samples at once
We have only qualified one sample so far, but now executing the Snakefie, we can process all the other samples in parallel. For this, we first are going to do a `dry-run` to check the list of commands that snakemake will run for us. On the command line (in our base folder, `00_Reproducible_RNA-Seq_Processing`), please write:
`snakemake -np featureCounts`
This command shows us all the steps that snakemake will until executing a rule named *featureCounts* (see Snakefile's code) which run `featureCounts` over all the samples. Where `-n` prevent snakemake from running the pipeline and `-p` prints the commands for each step.
**Warning**: Check if on job plan you find downloading processes. On this case, to avoid re-downloading data please update the time-stamp of the files located at `FASTQ`, `Genome` and `Gene_annotation` by running the following command.
`touch FASTQ/*.fastq.gz Genome/dm6.fa Gene_annotation/dm6.Ensembl.genes.*`
To visualise these steps run:
`snakemake featureCounts --dag | dot -Tpng > featureCounts.png`
This will produce the following image:
![](Images/featureCounts.png)
Which help us to understand the planned job execution.
Finally, to run these steps we need to enable snakemake to use the environment files are needed for each rule by including `--use-conda` and also we should limit the number of processors to 7 with `--cores 7`.
**Exercise 5.1.1**
A) Generate your own directed acyclic graph (DAG) and compare it with this manual. Are they identical? (hint: use scp transfer the image to your local computer so you can visualize it)
B) Execute the same command than before, but using `--rulegraph` instead of `--dag` and saving the image on a different file. What is the difference? Which one is more convenient to when you are working with a large number of samples?
**Exercise 5.1.2**
A) Execute `snakemake` to quantify all the samples with featureCounts using 6 cores and including the `--use-conda` flag to activate the automatic creation of virtual enviroments (hint: using a similar call than the fist command of section 5.1 - read the `snakemake --help`).
B) Inspect `core.yaml` file inside `env` folder. Do you recognise this software? Investigate in google how to create a conda environment from a `yaml` file and use your newly acquired knowledge to create a testing snakemake environment that has installed all the software described at `core.yaml`.
C) Look inside `Snakefile`code to see where the final output of featureCount will be stored and compare it with the output we previously had using featureCounts. What is the main difference ? Why ?
Snakemake can have individual files as a target. The following rule:
```{python}
rule bamstats:
input:
"hisat2/{sample}.sorted.bam"
output:
stats_txt = "QC/{sample}/{sample}.stats",
stats_html = "QC/{sample}/{sample}.plots.html"
params:
"QC/{sample}/{sample}.plots"
conda:
"envs/core.yaml"
shell:
"samtools stats {input} > {output.stats_txt} && plot-bamstats -p {params} {output.stats_txt}"
```
Was not included as part of our workflow, as it was not required to run featureCounts. To run this rule for a particular file, you have target one of the output that is generated by this rule for a particular file using the following formula:
`snakemake --use-conda QC/SAMPLE/SAMPLE.plots.html`
Where SAMPLE can be any of the accession codes corresponding to the samples stored inside the `FASTQ` folder.
**Exercise 5.1.3**
Can you run this rule for the largest and smallest sample? (hint: use `ls -lh` to check file size) What useful information can be found on the output html file? Are there any evident difference between the results for the largest and smaller sample?