Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BulkRNASeq workflow should determine adaptor type automatically #20

Open
Tracked by #29
J-81 opened this issue Apr 7, 2023 · 4 comments
Open
Tracked by #29

BulkRNASeq workflow should determine adaptor type automatically #20

J-81 opened this issue Apr 7, 2023 · 4 comments

Comments

@J-81
Copy link
Contributor

J-81 commented Apr 7, 2023

Currently workflow user is expected to replace this value manually in workflow module file.
Instead, the adaptor should be automatically determine, perhaps from the raw fastQC reports/multiQC and supplied to the trimming processing.

DPPD Reference

--illumina \ # if adapters are not illumina, replace with adapters used

Workflow Reference

trim_galore --gzip \
--cores $task.cpus \
--illumina \
--phred33 \

@J-81
Copy link
Contributor Author

J-81 commented Apr 7, 2023

Potential route using within trim_galore adaptor auto-detection:
https://github.com/FelixKrueger/TrimGalore/blob/0.6.7/Docs/Trim_Galore_User_Guide.md#adapter-auto-detection

@J-81
Copy link
Contributor Author

J-81 commented Apr 12, 2023

I'll try using auto-detect by omitting the flag, will of course validate if the auto detect is consistent with direct user supply of the parameter.

@J-81
Copy link
Contributor Author

J-81 commented May 8, 2023

Testing Results using GLDS-426_Truncated (Known to have Nextera adapters):

CURRENT (With --illumina)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a AGATCGGAAGAGC EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (301 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       118 (39.3%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         42,715 bp (94.9%)

With --nextera instead of --illumina

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; user defined)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (297 µs/read; 0.20 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)

With neither --nextera nor --illumina (i.e. autodetect mode)

Input filename: EU236_R2_raw.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.7
Cutadapt version: 3.7
Number of cores used for trimming: 1
Quality Phred score cutoff: 20
Quality encoding type selected: ASCII+33
Using Nextera adapter for trimming (count: 113). Second best hit was smallRNA (count: 16)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp
Output file will be GZIP compressed


This is cutadapt 3.7 with Python 3.9.6
Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA EU236_R2_raw.fastq.gz
Processing reads on 1 core in single-end mode ...
Finished in 0.09 s (310 µs/read; 0.19 M reads/minute).

=== Summary ===

Total reads processed:                     300
Reads with adapters:                       233 (77.7%)
Reads written (passing filters):           300 (100.0%)

Total basepairs processed:        45,000 bp
Quality-trimmed:                   2,140 bp (4.8%)
Total written (filtered):         33,046 bp (73.4%)

@J-81
Copy link
Contributor Author

J-81 commented May 25, 2023

@J-81 J-81 mentioned this issue May 25, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant