-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathindex.Rmd
437 lines (330 loc) · 17.4 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
---
title: "Tutorial of RNA-seq tumor immunity analysis"
author: "Yang Liu, Jennifer Altreuter, Yang Lin"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
#documentclass: book
#bibliography: [book.bib, packages.bib]
#biblio-style: apalike
link-citations: yes
description: "This is a minimal example of using the bookdown package to write a book. The output format for this example is bookdown::gitbook."
---
# Introduction {#intro}
This tutorial describes how to perform integrative computational analysis of **tumor immunity** using bulk RNA-sequencing (RNA-seq) data. We will focus on inferring immune infiltration levels, immune repertoire features, immune response and HLA type from a gene expression profile.\
To support the uniform analysis of bulk RNA-sequencing data, we developed a RNA-seq immune analysis pipeline named **RIMA** that is available at https://github.com/liulab-dfci/RIMA_pipeline. \
Tumor RNA-seq has become an important technique for molecular profiling and
immune characterization of tumors. RNA-seq Immune Analysis (RIMA) performs
integrative computational modeling of the tumor microenvironment from bulk tumor
RNA-seq data, which has the potential to offer essential insights to cancer
immunology and immune-oncology studies.\
```{r fig.align='center', echo=FALSE, fig.cap='Flowchat of RIMA pipeline'}
knitr::include_graphics('images/RIMA.png', dpi = NA)
```
The pre-processing module includes four main procedures:\
<ul>
<li>**Read mapping**</li>
<li>**Quality control**</li>
<li>**Gene quantification**</li>
<li>**Batch effect removal**</li>
</ul>
The downstream analysis includes seven modules:
<ul>
<li>**Differential expression analysis**</li>
<li>**Immune repertoire inference**</li>
<li>**Immune infiltration estimation**</li>
<li>**Immunotherapy response prediction**</li>
<li>**Gene fusion**</li>
<li>**Microbiome characterization**</li>
<li>**Neoantigen detection**</li>
</ul>
**Available Tools Checklist**
| **Methods** | **Description** |
| :---: | :---: | :---: |
| | **---PREPROCESSING---**|
| STAR | Spliced Transcript Alignment to a Reference |
| Salmon | Gene Quantification |
| RSeQC | High Throughput Sequence Data Evaluation |
| batch_removal| Remove Batch Effects Using Limma |
| | **---DIFFERENTIAL EXPRESSION---** |
| DESeq2 | Gene Differential Expression Analysis |
| GSEA | Gene Set Enrichment Analysis |
| ssGSEA | Single-sample GSEA |
| | **---IMMUNE REPERTOIRE---** |
| TRUST4 | TCR and BCR Sequence Analysis |
| | **---IMMUNE INFILTRATION---** |
| ImmuneDeconv | Cell Components Estimation |
| | **---IMMUNE RESPONSE---** |
| MSIsensor2 | Microsatellite Instability (MSI) Detection |
| TIDEpy | T cell dysfunction and exclusion prediction |
| | **---FUSION---** |
| STAR-Fusion | Identify the fusion gene pairs |
| | **---MICROBIOME---** |
| Centrifuge | Bacterial Abundance Detection |
| | **---NEO-ANTIGEN---** | |
| arcasHLA | HLA Class I and Class II Genotyping |
# How to run RIMA
<h2>***“Please make sure you adhere to NIH and HIPAA security standards when uploading controlled access data to a cloud or high computing environment.”***</h2>
## Install RIMA and Set Up the Running Environment ##
To run RIMA, you will create a working directory. In the working directory, you will install the pipeline code and other required software.
**Running machine requirements**
<ul>
<li> Cores and memory -- We usually run RIMA on a machine with at least 64 GB of memory to ensure a smooth read alignment run using STAR. </li>
<li>Storage -- You will need to reserve enough storage for the following: \
1. Reference files and pipeline code -- 65 GB \
2. Raw data -- we recommend reserving five times the size of your data files. For example, if your fastq files are 50GB each, then you should allot 250GB for each fastq file. Zipped fastq files are recommended. \
3. Required software -- reserve ~20GB for installation of miniconda, mamba and snakemake.
</li>
</ul>
**Prepare a working directory**
```
mkdir RIMA
```
**Get the pipeline code**
Download the RIMA_pipeline folder to your working directory.
```
git clone https://github.com/liulab-dfci/RIMA_pipeline.git
```
**Install miniconda**
Download and configure your conda environment. Follow the prompts on the installer screens.\
#Note: You will need to install miniconda3 into the RIMA directory. This means you need to override the default directory when miniconda3 is installed.\
#If you have conda installed elsewhere, you will still need a local copy in your RIMA directory in order for the environment installation shell program to work. (see below)\
```
#go to your RIMA directory if you are not there already
cd RIMA
#obtain the Miniconda3 code and install in the RIMA directory
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# !Don't forget to override the default directory for miniconda3!
# Install miniconda3 into your RIMA directory.
conda list #to see whether you have successfully installed conda
```
**Set up the running environment**
The RIMA environment can be set up by running a shell command. In the command, you need to specify the platform you are using -- Amazon Web Services (AWS) or Google Cloud Platform (GCP). This process may take ten to fifteen minutes. The user will need to answer two questions during the process -- one at the beginning of the program and one near the end -- to give permissions to install miniconda and snakemake.
#NOTE: before running the environment shell command, make sure to set the path for CONDA_ROOT
```
export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH
```
Run the shell program:
```
#shell command for
bash ./RIMA_environment.sh -p {platform --AWS, GCP}
#example for AWS
bash ./RIMA_environment.sh -p AWS
```
**Activate the RIMA environment**
Based on the conda environments information, the main environment "RIMA" can be activated as below.
```
#set the path if you this is a new login to your shell
export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH
# activate the RIMA conda environment
source activate RIMA
```
## Download pre-built references
A pre-prepared RIMA reference folder can be downloaded using the code below.
If you want to prepare a customized reference, you can follow [this tutorial](https://liulab-dfci.github.io/RIMA/customize-your-own-reference.html) to build your own reference.
The following link contains the **hg38** reference downloaded from <a href = 'https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files'>Genomic Data Commons</a> using version 22 index and annotation files.
```
cd RIMA
wget http://cistrome.org/~lyang/ref.tar.gz
# unzip the reference
tar -zxvf ref.tar.gz
# remove the reference zip file to save some space (optional)
rm ref.tar.gz
```
**Optional step**
If you download your RIMA reference folder into a directory other than RIMA, you have to change the symbolic link for the ref_files under the rnaseq_pipeline folder:
For example: if you downloaded the RIMA reference folder into /mnt/tutorial/ref_files, use the ln command to change the symbolic link.
```
cd RIMA
# remove the current link of ref_files
rm ref_files
# create a new symoblic link to your reference folder
ln -s /mnt/tutorial/ref_files
```
### Video tutorial
<iframe width="560" height="315" src="https://www.youtube.com/embed/XjggT6c0CW4" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Prepare input files
**Metasheet.csv**
Metasheet.csv is a comma-delimited file that resides in the RIMA_pipeline folder. The Metasheet.csv should contain phenotypic information about your samples that can be used for downstream analysis. Your metasheet should contain **Two Required Columns** (SampleName, Group) in comma-delimited format.
For this tutorial, RIMA uses demo data from Zhao, J., Chen, A.X., Gartrell, R.D. et al. Immune and genomic correlates of response to anti-PD-1 immunotherapy in glioblastoma. Nat Med 25, 462–469 (2019). <a href= 'https://www.nature.com/articles/s41591-019-0349-y'>*https://doi.org/10.1038/s41591-019-0349-y* </a>. We selected 12 samples from this trial and included 6 treatment naive samples (3 responder and 3 non-responders) for cohort level comparison.
<ul>
<li>**SampleName:** Sequencing sample id.</li>
<li>**PatName**: Patient id</li>
<li>**Group**: Sample included for cohort level comparison. 'NA' indicates samples that should not be included in the comparison. In the example below, we compare responders(R) and non-responders(NR) using only treatment naive (pre) patients. </li>
</ul>
You can add columns to the metasheet in order to compare other phenotypes e.g. columns for Timing Age, Sex etc.
```
SampleName,PatName,Group,Age,Tissue,Gender,Timing,Responder,TumorLocation,OngoingTreatment,PFS,Survival,OS,SampleId,syn_batch
SRR8281218,P20,NR,63,Brain,male,Pre,NR,temporal,no,110,1,278,4790-NL-AS,1
SRR8281219,P20,NA,63,Brain,male,Post,NR,temporal,no,110,1,278,4975-NL-AS,1
SRR8281226,P53,NR,70,Brain,male,Pre,NR,frontal,Yes,83,1,337,3981-NL-AS,2
SRR8281236,P53,NA,70,Brain,male,Post,NR,frontal,Yes,83,1,337,4760-D1,2
SRR8281233,P56,NR,54,Brain,female,Pre,NR,Parietal,no,83,1,83,4341-D3,3
SRR8281230,P56,NA,54,Brain,female,Post,NR,Parietal,no,83,1,83,4680-A1,3
SRR8281244,P100,NA,31,Brain,male,Post,R,Frontal-crossesmidline,yes,151,0,615,4956-NL-AS,2
SRR8281245,P100,R,31,Brain,male,Pre,R,Frontal-crossesmidline,yes,151,0,615,4595-G1,2
SRR8281243,P101,R,32,Brain,male,Pre,R,frontal,no,0,1,414,3542-NL-AS,2
SRR8281238,P102,NA,64,Brain,female,Post,R,Temporal-anterior,yes,519,0,539,5094-NL-AS,1
SRR8281251,P101,NA,32,Brain,male,Post,R,frontal,no,0,1,414,4943-A1,2
SRR8281250,P102,R,64,Brain,female,Pre,R,Temporal-anterior,yes,519,0,539,4443-NL-AS,1
```
**Config.yaml**
In the RIMA_pipeline folder, we have prepared a config.yaml as a template to run the pipeline. The config file is divided into three sections:
<ul>
<li> Fixed and user-defined parameters. </li>
<li> Cohort level analysis parameters. </li>
<li> Samples list. </li>
</ul>
You should provide these parameters with column names that match the columns in **metasheet.csv**.
Below is an example of a config.yaml file. You can set the patient name as the sample name. Note: Please make sure that sample names match the metasheet. Currently, only fastq files are accepted as input (including fastq.gz).
```
#########Fixed and user-defined parameters################
metasheet: metasheet.csv # Meta info
ref: ref.yaml # Reference config
assembly: hg38
cancer_type: GBM #TCGA cancer type abbreviations
rseqc_ref: house_keeping #Option: 'house_keeping' or 'false'.
#By default, a subset of housekeeping genes is used by RSeQC to assess alignment quality.
#This reduces the amount of time needed to run RSeQC.
mate: [1,2] #paired-end fastq format, we recommend naming paired-end reads with _1.fq.gz and _2.fq.gz
#########Cohort level analysis parameters################
design: Group # Condition on which to do comparsion (as set up in metasheet.csv)
Treatment: R # Treatment use in DESeq2, corresponding to positive log fold change
Control: NR # Control use in DESeq2, corresponding to negative log fold change
batch: syb_batch # Options: 'false' or a column name from the metasheet.csv.
# If set to a column name in the metasheet.csv, the column name will be used for batch effect analysis (limma).
# It will also be used as a covariate for differential analysis (DESeq2) to account for batch effect.
pre_treated: false # Option: true or false.
# If set to false, patients are treatment naive.
# If set to true, patients have received some form of therapy prior to the current study.
#########Samples list##############
samples:
SRR8281218:
- /mnt/zhao_trial/data/SRR8281218_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281218_2.fastq.gz
SRR8281219:
- /mnt/zhao_trial/data/SRR8281219_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281219_2.fastq.gz
SRR8281226:
- /mnt/zhao_trial/data/SRR8281226_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281226_2.fastq.gz
SRR8281236:
- /mnt/zhao_trial/data/SRR8281236_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281236_2.fastq.gz
SRR8281230:
- /mnt/zhao_trial/data/SRR8281230_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281230_2.fastq.gz
SRR8281233:
- /mnt/zhao_trial/data/SRR8281233_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281233_2.fastq.gz
SRR8281244:
- /mnt/zhao_trial/data/SRR8281244_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281244_2.fastq.gz
SRR8281245:
- /mnt/zhao_trial/data/SRR8281245_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281245_2.fastq.gz
SRR8281243:
- /mnt/zhao_trial/data/SRR8281243_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281243_2.fastq.gz
SRR8281251:
- /mnt/zhao_trial/data/SRR8281251_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281251_2.fastq.gz
SRR8281238:
- /mnt/zhao_trial/data/SRR8281238_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281238_2.fastq.gz
SRR8281250:
- /mnt/zhao_trial/data/SRR8281250_1.fastq.gz
- /mnt/zhao_trial/data/SRR8281250_2.fastq.gz
```
**execution.yaml**
Use **execution.yaml** to control which modules to run in RIMA. The preprocess module outputs are required for optional downstream analysis modules. Details of each module will be introduced in the next chapters.
```
##Note: The preprocess individual and cohort modules are necessary to obtain alignment and quality results.
##Run the remaining modules only after these two modules.
preprocess_individual: true
preprocess_cohort: true
##Optional modules
##Note: The below modules are specialized modules, each dealing with specific targets.
##Make sure to run both the individual and cohort parts of each module
##to get all results.
##Individual runs
mutation_individual: false
immune_response_individual: false
immune_repertoire_individual: false
microbiome_individual: false
neoantigen_individual: false
##Cohort runs
differential_expression_cohort: false
immune_infiltration_cohort: false
mutation_cohort: false
immune_response_cohort: false
immune_repertoire_cohort: false
microbiome_cohort: false
neoantigen_cohort: false
```
**ref.yaml**
The ref.yaml file provides the paths for all reference files. If you downloaded the pre-built reference files, you should not need to change the ref.yaml file.
```
#REF.yaml - File to contain paths to static reference files that RNA-Seq requires
# for its analysis.
# NOTE: these are already pre-filled for hg38
# NOTE: organized by assemblies
#
# !VERY IMPORTANT: you can OVERRIDE any of these values in config.yaml!
hg38:
###annotation path
fasta_path: ./ref_files/GRCh38.d1.vd1.fa
gtf_path: ./ref_files/gencode.annotation.gtf
bed_path: ./ref_files/RSeQC_bed/refseqGenes.bed
housekeeping_bed_path: ./ref_files/RSeQC_bed/housekeeping_refseqGenes.bed
trust4_reper_path: ./ref_files/TRUST4/hg38_bcrtcr.fa
trust4_IMGT_path: ./ref_files/TRUST4/human_IMGT+C.fa
###index path
star_index: ./ref_files/star_index
star_fusion_index: ./ref_files/fusion_index/ctat_genome_lib_build_dir
salmon_index: ./ref_files/salmon_index/transcripts.fa
msisensor_index: ./ref_files/msisensor_index/hg38/microsatellite.list
centrifuge_index: ./ref_files/centrifuge_index/p_compressed+h+v
annotation_pyprada: ./ref_files/pyPRADA/gencode.canonical.gene.exons.tab.txt
###tool path
msisensor2_path: ./ref_files/msisensor2
trust4_path: ./ref_files/TRUST4
arcasHLA_path: ./ref_files/arcasHLA
prada_path: ./ref_files/pyPRADA/pyPRADA_1.2
```
### Video tutorial
<iframe width="560" height="315" src="https://www.youtube.com/embed/o5QA51nZVtI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Running RIMA
**Check the path of your conda env**
```
conda env list
# conda environments:
#
base * /home/ubuntu/miniconda3
centrifuge_env /home/ubuntu/miniconda3/envs/centrifuge_env
fusion_env /home/ubuntu/miniconda3/envs/fusion_env
rnaseq /home/ubuntu/miniconda3/envs/rnaseq
rseqc_env /home/ubuntu/miniconda3/envs/rseqc_env
stat_perl_r /home/ubuntu/miniconda3/envs/stat_perl_r
```
**Check the pipeline with a dry run to ensure correct script and data usage.**
```
snakemake -s RIMA.snakefile -np
```
**Submit the job.**
Alignment and some of the other modules of RIMA will take several hours to run. It is recommended that you run RIMA in the background using a command such as nohup as below.
```
nohup time snakemake -p -s RIMA.snakefile -j 4 > RIMA.out &
```
note: Argument -j sets the cores for parallel runs (e.g. '-j 4' can run 4 jobs in parallel at the same time.). Argument -p prints the command in each rule. Note: Here, output log records the run information. A user may run one module at a time to obtain a record of each module's output log.
## Output
Output folders are generated in a folder called "analysis". The output structure of each module will be introduced in the next chapters.
```{r, include=FALSE}
# automatically create a bib database for R packages
knitr::write_bib(c(
.packages(), 'bookdown', 'knitr', 'rmarkdown'
), 'packages.bib')
```