BulkRNASeq workflow should determine adaptor type automatically #20

J-81 · 2023-04-07T16:31:07Z

Currently workflow user is expected to replace this value manually in workflow module file.
Instead, the adaptor should be automatically determine, perhaps from the raw fastQC reports/multiQC and supplied to the trimming processing.

DPPD Reference

GeneLab_Data_Processing/RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md

Line 207 in 0fe1dfd

--illumina \ # if adapters are not illumina, replace with adapters used

Workflow Reference

GeneLab_Data_Processing/RNAseq/Workflow_Documentation/NF_RCP-F/workflow_code/modules/quality.nf

Lines 73 to 76 in 0fe1dfd

    
               trim_galore --gzip \ 
        
               --cores $task.cpus \ 
        
               --illumina \ 
        
               --phred33 \

J-81 · 2023-04-07T16:33:31Z

Potential route using within trim_galore adaptor auto-detection:
https://github.com/FelixKrueger/TrimGalore/blob/0.6.7/Docs/Trim_Galore_User_Guide.md#adapter-auto-detection

J-81 · 2023-04-12T21:22:56Z

I'll try using auto-detect by omitting the flag, will of course validate if the auto detect is consistent with direct user supply of the parameter.

J-81 · 2023-05-08T17:38:33Z

Testing Results using GLDS-426_Truncated (Known to have Nextera adapters):

CURRENT (With --illumina)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (301 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       118 (39.3%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         42,715 bp (94.9%)

With --nextera instead of --illumina

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (297 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)

With neither --nextera nor --illumina (i.e. autodetect mode)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Using Nextera adapter for trimming (count: 113). Second best hit was smallRNA (count: 16)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (310 µs/read; 0.19 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)

J-81 · 2023-05-25T21:54:51Z

Implemented in 3b7e0ba
DPPD Updated in 2a56552

J-81 mentioned this issue May 25, 2023

NF_RCP-F_1.0.4-RC1 #29

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BulkRNASeq workflow should determine adaptor type automatically #20

BulkRNASeq workflow should determine adaptor type automatically #20

J-81 commented Apr 7, 2023

J-81 commented Apr 7, 2023 •

edited

Loading

J-81 commented Apr 12, 2023

J-81 commented May 8, 2023 •

edited

Loading

J-81 commented May 25, 2023

BulkRNASeq workflow should determine adaptor type automatically #20

BulkRNASeq workflow should determine adaptor type automatically #20

Comments

J-81 commented Apr 7, 2023

DPPD Reference

Workflow Reference

J-81 commented Apr 7, 2023 • edited Loading

J-81 commented Apr 12, 2023

J-81 commented May 8, 2023 • edited Loading

CURRENT (With --illumina)

With --nextera instead of --illumina

With neither --nextera nor --illumina (i.e. autodetect mode)

J-81 commented May 25, 2023

J-81 commented Apr 7, 2023 •

edited

Loading

J-81 commented May 8, 2023 •

edited

Loading