-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSeed_sequence_processing.Rmd
executable file
·351 lines (249 loc) · 8.85 KB
/
Seed_sequence_processing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "Argonne_Seed_seq_processing_June2023"
author: "Abby Sulesky-Grieb"
date: "2023-06-20"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
```
# Code used on the MSU HPCC for QIIME2 workflow
## Copy demultiplexed sequences into working space from raw_sequence directory
```{r}
#!/bin/bash -login
#SBATCH --time=03:59:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=30G
#SBATCH --job-name copy_sequences
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END
######## Job code
cp *.fastq /mnt/research/ShadeLab/WorkingSpace/Sulesky/New_Argonne_Seed_Sequencing/input_data
echo -e "\n `sacct -u suleskya -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem` \n"
scontrol show job $SLURM_JOB_ID
```
submit job:
sbatch copy_seqs.sb
## Rename files for Figaro to run correctly
rename using the util-linux command "rename"
use module load to load util-linux
```{r}
module spider util-linux
# choose most recent version
module spider util-linux/2.37
# follow instructions to load
module load GCCcore/11.2.0
module load util-linux/2.37
module list #check which packages are loaded
```
Copy sequencing files into figaro_input directory: /mnt/research/ShadeLab/WorkingSpace/Sulesky/New_Argonne_Seed_Sequencing/figaro/figaro_input
Use 'rename' to make files match illumina naming convention: SampleName_S1_L001_R1_001.fastq
Example:
> rename G0_ G0 *.fastq
this code takes all files that end in *fastq, replaces "G0_" with "G0"
had to run various lines to remove extra underscores in sample IDs and add L001 to all
rename goes really quickly, ~1 second per command
## Figaro for trim parameters
make figaro output folder in workingspace directory /New_Argonne_Seed_Sequencing/figaro
go to home directory where figaro is installed to run figaro:
cd /mnt/home/suleskya/figaro-master/figaro
Run figaro as a job:
nano figaro_argonne_june2023.sb
```{r}
#!/bin/bash -login
########## SBATCH Lines for Resource Request ##########
#SBATCH --time=3:59:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=64G
#SBATCH --job-name figaro
#SBATCH -A shade-cole-bonito
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END
########## Command Lines for Job Running ##########
conda activate figaro
python figaro.py -i /mnt/research/ShadeLab/WorkingSpace/Sulesky/New_Argonne_Seed_Sequencing/figaro/figaro_input -o /mnt/research/ShadeLab/WorkingSpace/Sulesky/New_Argonne_Seed_Sequencing/figaro -f 1 -r 1 -a 253 -F illumina
conda deactivate
echo -e "\n `sacct -u suleskya -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem` \n"
scontrol show job $SLURM_JOB_I
```
sbatch figaro_argonne_june2023.sb
check job progress with "sq" command
Once job is finished, check the figaro output:
cd
> less trimParameters.json
{
"trimPosition": [
191,
84
],
"maxExpectedError": [
2,
2
],
"readRetentionPercent": 94.64,
"score": 92.63569499836179
},
> control Z to exit "less"
we will truncate the sequences at forward 191 and reverse 84, which will merge 94.64 percent of the reads
## Import data into Qiime2 format
Have to rename files in input-data for Qiime2 input with the same method as above
(base) -bash-4.2$ rename G0_ G0. *.fastq
(base) -bash-4.2$ rename G1_ G1. *.fastq
(base) -bash-4.2$ rename G2_ G2. *.fastq
(base) -bash-4.2$ rename Shade_ Shade *.fastq
(base) -bash-4.2$ rename _Fina Fina *.fastq
(base) -bash-4.2$ rename _R _L001_R *.fastq
have to zip the fastqs with gzip to .gz (did this in a job, it took 3 hours)
```{r}
#!/bin/bash -login
########## SBATCH Lines for Resource Request ##########
#SBATCH --time=3:59:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=64G
#SBATCH --job-name import_data
#SBATCH -A shade-cole-bonito
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END
########## Command Lines for Job Running ##########
conda activate qiime2-2022.8
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path input_data \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path demux-paired-end.qza
conda deactivate
echo -e "\n `sacct -u suleskya -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem` \n"
scontrol show job $SLURM_JOB_I
```
Successful, saved data as demux-paired-end.qza, can use this file in dada2
## Denoise and merge
```{r}
#!/bin/bash -login
########## SBATCH Lines for Resource Request ##########
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=100G
#SBATCH --job-name dada2
#SBATCH -A shade-cole-bonito
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END
########## Command Lines for Job Running ##########
conda activate qiime2-2022.8
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux-paired-end.qza \
--p-trunc-len-f 191 \
--p-trunc-len-r 84 \
--o-table table.qza \
--o-representative-sequences rep-seqs.qza \
--o-denoising-stats denoising-stats.qza
qiime metadata tabulate \
--m-input-file denoising-stats.qza \
--o-visualization denoising-stats.qzv
qiime feature-table summarize \
--i-table table.qza \
--o-visualization table.qzv \
--m-sample-metadata-file 230126_Shade_Seeds_16s_DG.txt
qiime feature-table tabulate-seqs \
--i-data rep-seqs.qza \
--o-visualization rep-seqs.qzv
conda deactivate
echo -e "\n `sacct -u suleskya -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem` \n"
scontrol show job $SLURM_JOB_I
```
job successful! took 12 hr 24 minutes
## Taxonomy assignment with silva
# Assign taxonomy using Silva
Download reference seqs from qiime2.org:
wget https://data.qiime2.org/2022.8/common/silva-138-99-515-806-nb-classifier.qza
job:
nano classify-silva-taxonomy.sb
```{bash}
#!/bin/bash -login
########## SBATCH Lines for Resource Request ##########
#SBATCH --time=08:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=64G
#SBATCH --job-name taxonomy
#SBATCH -A shade-cole-bonito
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END
########## Command Lines for Job Running ##########
conda activate qiime2-2022.8
qiime feature-classifier classify-sklearn \
--i-classifier silva-138-99-515-806-nb-classifier.qza \
--i-reads rep-seqs.qza \
--o-classification taxonomy.qza
qiime metadata tabulate \
--m-input-file taxonomy.qza \
--o-visualization taxonomy.qzv
qiime tools export \
--input-path taxonomy.qza \
--output-path phyloseq
qiime tools export \
--input-path table.qza \
--output-path phyloseq
biom convert \
-i phyloseq/feature-table.biom \
-o phyloseq/otu_table.txt \
--to-tsv
conda deactivate
echo -e "\n `sacct -u suleskya -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem` \n"
scontrol show job $SLURM_JOB_I
```
Export OTU table to new directory phyloseq
qiime tools export \
--input-path table.qza \
--output-path phyloseq
OTU tables exports as feature-table.biom so convert to .tsv
biom convert \
-i phyloseq/feature-table.biom \
-o phyloseq/otu_table.txt \
--to-tsv
download otu table, manually change "#OTU ID" column header to "OTUID"
download taxonomy file and manually change Feature ID to OTUID in taxonomy.tsv. change taxonomy and OTU tables to csv format.
these files are now ready to export to R and run using phyloseq
# Make phylogenetic tree with de novo alignment
align reads de novo to make multiple sequence alignment (MSA)
mask alignment to reduce ambiguity
make tree with fasttree
root tree to midpoint
pipeline below will do all of these steps with default settings and save everything to output directory
export makes newick format file for the trees, or .qza can be used in iTOL
```{r}
#!/bin/bash -login
########## SBATCH Lines for Resource Request ##########
#SBATCH --time=08:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=64G
#SBATCH --job-name fasttree
#SBATCH -A shade-cole-bonito
#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END
########## Command Lines for Job Running ##########
conda activate qiime2-2022.8
qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences rep-seqs.qza \
--output-dir mafft-fasttree-output
qiime tools export \
--input-path mafft-fasttree-output/tree.qza \
--output-path mafft-fasttree-output
qiime tools export \
--input-path mafft-fasttree-output/rooted_tree.qza \
--output-path mafft-fasttree-output/exported-rooted-tree/
conda deactivate
echo -e "\n `sacct -u suleskya -j $SLURM_JOB_ID --format=JobID,JobName,Start,End,Elapsed,NCPUS,ReqMem` \n"
scontrol show job $SLURM_JOB_I
```