-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
397 lines (278 loc) · 15.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
###############################################################################
Copyright (c) 2016, David W Morgens
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted
provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions
and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
###############################################################################
Analyzing screen data using casTLE.
David Morgens
05/23/2016
Version 1.0
Version 0.7 used in Morgens et al (2016) "Systematic comparison of CRISPR/Cas9
and RNAi screens for essential genes." Nat Biotechnol advance on. Implementation
has changed but results should be identical. Version 0.7 scripts available
from Scripts/Scripts0.7
If you use these scripts please cite:
Morgens DW, Deans RM, Li A, Bassik MC (2016) Systematic comparison of
CRISPR/Cas9 and RNAi screens for essential genes. Nat. Biotechnol. advance on
###############################################################################
# Requirements
###############################################################################
Scripts are written for python 2.7, though should be compatible with python 3.
makeCount.py and makeIndices.py requires bowtie installed
Required modules: numpy, math, csv, collections, os, scipy, sys, random, re,
argparse, subprocess, shlex, time, matplotlib, warnings
Optional module: pp (Parallel python), which will allow for parallel processing,
greatly increasing the speed of computation.
###############################################################################
# Installation
###############################################################################
Option 1 for repository functionality:
Download and install mercurial from https://www.mercurial-scm.org/
Run command:
hg clone https://bitbucket.org/dmorgens/castle
Option 2 for source file download:
Go to https://bitbucket.org/dmorgens/castle/downloads
Select "Download repository".
Extract folder.
Python 2.7 and modules available from https://www.continuum.io/downloads
Parallel Python module available from http://www.parallelpython.com/
Bowtie available at http://bowtie-bio.sourceforge.net/index.shtml
###############################################################################
# Quick use:
###############################################################################
Use -h to view documentation.
Make bowtie index:
python Scripts/makeIndices.py <oligo file> <screen type> <output bowtie file>
Align sequence files:
python Scripts/makeCounts.py <file base for fastq files> <output file> <screen type>
Compare two count files with casTLE:
python Scripts/analyzeCounts.py <count file 1> <count file 2> <output results file>
Calculate p-values for casTLE result file:
python Scripts/addPermutations.py <results file> <number of permutations>
Combine multiple casTLE result files:
python Scripts/analyzeCombo.py <results file 1> <results file 2> <output file>
Calculate p-values for combination of multiple casTLE results:
python Scripts/addCombo.py <combo file> <number of permutations>
Plot distribution of elements from count file:
python Scripts/plotDist.py <output name> <count file 1> <count file 2> ...
Plot casTLE result file:
python Scripts/plotVolcano.py <results file>
Plot individual gene results from casTLE result file:
python Scripts/plotGenes.py <results file> <gene name 1> <gene name 2> ...
Compare enrichment of individual elements between multiple result files:
python Scripts/plotElements.py <results file 1> <results file 2> <output name>
Compare effect size and confidence between multiple result files:
python Scripts/plotRep.py <results file 1> <results file 2> <output name>
###############################################################################
# Overview of directory system
###############################################################################
For the sake of reproducibility and organization, the casTLE scripts work
only within the provided directory structure. This can generally be overrriden
using -or, which waives the requirement for record files, and -of, which waives
automatic placement of files in certain folders.
When running many of the scripts, they will create a record file *_record.txt,
which contains a record of what parameters were used as well as provides downstream
scripts with neccessary information. Because of this memory, most scripts will
not work if a file is moved out of the directory system.
Directory is as follows:
Data - contains count files
GenRef - contains gene descriptions, GO terms, and symbol/ID conversion info
Indices - contains bowtie indexes for alignments
Records - contains permanent record files for reproducibility's sake.
Results - contains result and combo files
Scripts - contains casTLE scripts.
All analyses should be run from the top folder.
python Scripts/*.py
###############################################################################
# Overview of file types and formats
###############################################################################
# Count file
Description: The counts of elements in a single sample.
Naming scheme: <name>_counts.csv
Location: Stored in Data folder
Format: Tab or comma delimited. First column element name, second column counts
<element_name>,<count for element>
Record: <name>_record.txt
Example:
0None_none_ACOC_204550.4501,252
0None_none_ACOC_204552.4503,239
0None_none_ACOC_204555.4506,105
0None_none_ACOC_204562.4513,180
Each element name must contain the target gene ID as the first part:
<geneID>_<Element ID>
The format for element names allows downstream programs to understand gene
targets and identify negative controls. The first part of the element name
is used as the GeneID. And negative controls are indicated by the starting string.
If the symbol denoting negative controls was '0', then both
0None_none_ACOC_204550.4501
0Safe_safe_ACOC_204123.3245
would be considered negative controls.
For display of individaul element enrichments, the last part of the element
name is used as an element ID. Note this does not have to be unique between
genes.
###############################################################################
# Result file
Description: The output of casTLE from the comparison of two count files
Naming scheme: <name>.csv
Location: Stored in Results folder
Format: Comma delimited.
Record: <name>_record.txt
Each row represents a single gene.
"GeneID", "Symbol", and "GeneInfo" identify the gene
"Localization", "Process", and "Function" display the GO terms for that gene
"Element #" indicates the number of elements targeting that gene found
"casTLE Effect" indicates the most likely effect size as determined by casTLE
"casTLE Score" indicates the confidence in that effect size, where larger is
more confident.
"casTLE p-value" is the estimated p-value from the casTLE Score. If 'N/A',
permutations need to be run to estimate.
"Minimum Effect Estimate" and "Maximum Effect Estimate" indicate the 95% credible
interval for the casTLE effect size estimate
"Individual Elements" indicates the individual element enrichments. These
are formatted as: <enrichment value> : <element ID>
###############################################################################
# Combo file
Description: The output of casTLE from the combination of two result files
Naming scheme: <name>.csv
Location: Stored in Results folder
Format: Comma delimited.
Record: <name>_record.txt
Each row represents a single gene
"GeneID", "Symbol", and "GeneInfo" identify the gene
"Localization", "Process", and "Function" display the GO terms for that gene
"Element # 1" and "Element # 2" indicates the number of elements targeting that gene
found in each result file.
"casTLE Effect 1" and "casTLE Effect 2" indicates the most likely effect size
as determined by casTLE from each individual result
"casTLE Score 1" and "casTLE Score 2" indicates the confidence in that effect
size, where larger is more confident, from each individual result.
"Combo casTLE Effect" indicates the most likely effect size as determined by
casTLE from the combination of both results.
"Combo casTLE Score" indicates the confidence in that effect size, where
larger is more confident, from the combination of both results.
"Combo casTLE p-value" is the estimated p-value from the Combo casTLE score.
If "N/A", then run addCombo.py to calculate p-values.
"Minimum Effect Estimate" and "Maximum Effect Estimate" indicate the 95% credible
interval for the Combo casTLE effect size estimate
###############################################################################
# Overview of procedures
###############################################################################
#
Overview of alignment and count file creation
Overview of making result files from count files
Overview of combination analysis
###############################################################################
# Overview of alignment and count file creation
1) Create a bowtie index using makeIndices.py. This will also let you name your
screen type for future reference. To do this you need an oligo file containing
the name and sequence of each element in your screen. This is a comma-delimited
file with two columns: <element name>,<element sequence>. See count file
overview for element naming schemes.
2) Use makeCounts.py to create a count file from your fastq file. Each count
file will correspond to a single condition.
References to each bowtie index are stored in Indices/screen_type_index.txt
in tab-delimited form. This allows both a quick reference name and a full
descriptive name.
###############################################################################
# Overview of making result files from count files
Indicate two count files to compare using:
python Scripts/analyzeCounts.py <count file 1> <count file 2> <output results file>
python Scripts/analyzeCounts.py Data/untreated_counts.csv Data/treated_counts.csv name
A result file will be created at Results/<output results file>.
If using count files not created by makeCounts.py. You will need to override
the search for a record file. To do this you will also need to indicate a
negative control symbol or select a non-negative control background with -b:
python Scripts/analyzeCounts.py <count file 1> <count file 2> <output results file>
-n <negative symbol>
-ro
or
python Scripts/analyzeCounts.py <count file 1> <count file 2> <output results file>
-b tar
-ro
2) In order to calculate p-values from the casTLE score distrubution, the null
distribution needs to be estimated by permutation. To this, run
python Scripts/addPermutations.py <result file> <number of permutations>
python Scripts/addPermutations.py Results/Test/result.csv 10000
This command will create an auxiliary ref file and replace your result file
with one containing estimated p-values. Note that the p-value is an estimate
and will become more accurate (and potentially more significant) with more
permutations. To add more permutations to a file, simply call addPermutations.py
again. It will add to the existing number of permutations.
###############################################################################
# Overview of combination analysis
1) casTLE can also combine results from multiple screens. This can either be
combining two replicates of the same screen, or combining two disparate screening
types, such as shRNA and CRISPR/Cas9. Use analyzeCombo.py to indicate two result
files:
python Scripts/analyzeCombo.py <result file 1> <result file 2> <output file>
This will create a combo file at Results/<output file>.
2) To estimate p-values for combination results, use addCombo.py
python Scripts/addCombo.py <combo file> <number of permutations>
###############################################################################
# Available plotting scripts
###############################################################################
In order to visualize your data, the available plotting scripts are:
plotDist.py
plotVolcano.py
plotGenes.py
plotElements.py
plotRep.py
###############################################################################
# plotDist.py
Plot distribution of elements from count file:
python Scripts/plotDist.py <output name> <count file 1> <count file 2> ...
This will create an image at Results/<output name>
This will allow the visualization of the diversity in the given count files.
This can be used to identify bottlenecked samples or see the effect of
selection on the diversity. Diversity scores are calculated as normalized
entropy using the total number of elements to define the max entropy.
Any number of count files can be used. Use -l to label samples.
Example:
python Scripts/plotDist.py test_dist Data/plasmid_counts.csv Data/untreated_counts.csv Data/treated_counts.csv -l Plasmid Untreated Treated
###############################################################################
# plotVolcano.py
Creates a 2D histogram of effect by significance
python Scripts/plotVolcano.py <results file>
Visualizes the results from a single result file. Individual points can be labeled
with -n. Creates image at <results file>_volcano.*
Image type can be changed with -f.
Example:
python Scripts/plotVolcano.py Results/results.csv -n MTOR MYC -f eps
###############################################################################
# plotGenes.py
Plot individual gene results from casTLE result file:
python Scripts/plotGenes.py <results file> <gene name 1> <gene name 2> ...
Create cloud and hist plots for individual genes from a given result file.
Cloud plots show data as raw counts with negative controls as reference.
Hist plots show density plot for targeting elements and negative controls.
Created files are placed at <results file>_<gene name>_*
Example:
python Scripts/plotGenes.py Results/results.csv MTOR MYC
###############################################################################
# plotElements.py
Compare enrichment of individual elements between multiple result files:
python Scripts/plotElements.py <results file 1> <results file 2> <output name>
Example:
python Scripts/plotElements.py Results/replicate1.csv Results/replicate2.csv rep_rep1_rep2
###############################################################################
# plotRep.py
Compare effect size and confidence between multiple result files:
python Scripts/plotRep.py <results file 1> <results file 2> <output name>
This allows you to compare two result files, visualizing reproducibility or
lack thereof.
Example:
python Scripts/plotRep.py Results/replicate1.csv Results/replicate2.csv rep_rep1_rep2
###############################################################################