Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect command should determine whether adapters are 5' or 3' #26

Open
jdidion opened this issue Jun 3, 2017 · 9 comments
Open

Detect command should determine whether adapters are 5' or 3' #26

jdidion opened this issue Jun 3, 2017 · 9 comments
Milestone

Comments

@jdidion
Copy link
Owner

jdidion commented Jun 3, 2017

The detect command is currently optimized for 3' adapter detection. It should be possible to detect which end the adapters are on by looking for where long runs of As occur.

@jdidion jdidion added this to the 2.0 milestone Oct 10, 2017
@wckdouglas
Copy link

Hi, I would be happy to contribute to this. Am I correct that "long runs of As" are as defined by past_end_bases under Detector class?

@jdidion
Copy link
Owner Author

jdidion commented Feb 5, 2018

Thanks @wckdouglas! Yes that's correct - at least for Illumina data, when the sequencer reads through the template and the opposite adapter completely, the following bases tend to be a run of 'A's. Currently I use a threshold of at least 8 bp to detect complete adapter read-through. For the detect command, you could make the pattern that is searched for configurable via a command line option.

@wckdouglas
Copy link

@jdidion ah thanks, I see! I am seeing the current implementation uses the option -past-end-bases that accepts regex, so it seems to take care of it. Or should it be detecting using the Aligner().locate() method if the pattern is customized?

@jdidion
Copy link
Owner Author

jdidion commented Feb 9, 2018

The code currently does the correct thing for 3' end adapters, but it assumes the adapters will always be at the 3' end. The idea is to search for runs of A's and determine whether they tend to occur towards the beginning or the end of the read. This would need to use some statistical test, as runs of A's will be distributed at various positions, but there should be a 'peak' toward the left or the right side for 5' or 3' adapters, respectively. Does that make sense?

@wckdouglas
Copy link

Thanks for the explanation, I think I am starting to get it. So am I right that we are assuming (1) sometimes adapter are not on the 3' end of the reads and bases after the adapter maybe biological meaningful. this seems impossible based on Fig 1A from the atropos peerj paper. (2) most of the time long runs of A's occur after adapter, these are the empty cycles so they have to be removed like atropos currently does before adapter detections, and (3) long runs of A's maybe biological sequences and in these cases, they should not be removed.

Anyways, would it make sense if I start implementing a counter to record the positions of runs of A's? Also is there a particular test fastq that would be good for testing this function?

@jdidion
Copy link
Owner Author

jdidion commented Mar 16, 2018

Sorry for taking so long to get back to you.

There are some library preparations in which adapters are added to the 5' ends of reads (see the section "5' adapters" in the documentation). These are less common, but we still want to handle them in Atropos.

In the Atropos paper, the focus is on the improved paired-end read trimming, as this is the major improvement from Cutadapt. Currently, only 3' end adapters are supported by the insert aligner. However, 5' adapters can still be handled (using -g/-G or -b/-B).

You are correct in points #2 and #3. However, I was partially wrong in what I said earlier - empty cycles won't occur at the 5' end of the read. The only reason for a 5' adapter not matching at the start of the 5' end is if the sequence is degraded. Thus, long runs of A's at the 5' end are not a signal of 5' adapters. Instead, I think the best that can be done is to build up a distribution of the positions of long runs of A's in a subsample of reads, and if it is shifted far enough to the 3' end, you can be pretty confident that you have 3' adapters. Otherwise you tell the user that the adapter type cannot be determined.

For test data, you can look at the datasets I use in the paper. If you look at paper/containers/data/rna/build.sh and /paper/containers/data/wgbs/build.sh you can see how to download these data from SRA. However, these data both use 3' adapters. I am not sure where to get data that uses 5' adapters - you might have to simulate it.

@jdidion
Copy link
Owner Author

jdidion commented Mar 16, 2018

@wckdouglas if you'd like to work on this issue (i.e. if it will help you in your own work) please, by all means. However, I think issue #65 is higher priority and would be more straight-forward to implement.

@wckdouglas
Copy link

wckdouglas commented Mar 18, 2018

Agree, this issue (#26) will require some careful thoughts and design. I will see what I can do with #65, which will likely be a very helpful feature for my own workflow as well.

@jdidion
Copy link
Owner Author

jdidion commented Mar 19, 2018

Excellent, thanks.

@jdidion jdidion modified the milestones: 2.0, 2.2 Dec 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants