-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is linear synteny assumed? #233
Comments
Hi Bob, could you share you example?
wfmash first runs a similarity search (with MashMap3) to find similar regions between the input sequences, and then apply end-to-end alignment (hierarchical wavefront alignment) only to those.
Sent from Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Bob Harris ***@***.***>
Sent: Saturday, April 6, 2024 10:26:36 AM
To: waveygang/wfmash ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [waveygang/wfmash] Is linear synteny assumed? (Issue #233)
I am new to wfmash (but of course have plenty of alignment experience).
I tried an experiment in which I generated a random genome and a second with rearrangements (including random reversals) of items from the first with various lengths (from 1K to 10K) and divergences between 0 and 20%. No indels (other than the rearrangements). There are no duplications ― and every bit of each genome is homologous to a unique bit of the other.
I tried aligning using wfmash apple.fa orange.fa --map-pct-id=70 and no alignments at all were reported. Total genome length was ≈400K.
I should note that I built wfmash from a clone of just a couple days ago (v0.13.0-3-gc18520b), and that I have run a few other experiments mapping similarly diverged items, but as separate reads, to the random genome, and I do get alignments in that case (though not as many as I might expect).
I notice in issue #161<#161> an Oct/24/2022 post that shows a dot plot of a whole genome alignment that looks mostly syntenic. Which makes me wonder if wfmash assumes end-to-end synteny.
More generally I'm trying to understand how I need to parameterize wfmash to find all the homologies in my sequences, and to understand what its limitations are w.r.t. divergence and lengths.
―
Reply to this email directly, view it on GitHub<#233>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AO26XHQCKBAAK33G436JYZTY4AAZZAVCNFSM6AAAAABF2PSZPGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDSMRXGIYTONI>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
@AndreaGuarracino I've placed the example here: https://drive.google.com/drive/folders/1z69e_uTaYdMjEdP9m5SGf2CPaEm3KiO8?usp=share_link apple.fa and orange.fa are the fake genomes. barcodes_locations.dat gives hints about the embedded homologies. For example, |
Hi @rsharris, It looks like the cause of most of the missing mappings is that the colinear chains identified by MashMap are too short to pass the "block-length filter."
So in your example, all colinear chains w/ block-length < 25k are being filter out. Keep in mind that even if you turn the block-length filter off with You can set the paramters of
|
Thanks, @bkille , that's helpful, and it's the kind of insight I'm trying to gather, from a user perspective. I'm running some additional experiments with randoms and expect to have some more observations in later this afternoon. |
Based on my (admittedly very limited) experiments, I think the statement "wfmash is designed ... whole genome alignment ... can handle ... average nucleotide identity ... as low as 70%" is misleading. As currently written (in the README), a naive interpretation is wfmash it will discover all homologies with at ANI≥70%. But in reality --map-pct-id=70 just sets the ANI threshold for the mashmap step. And my simple experiments appear to show the discovery threshold for that step is somewhat higher than 70%. I ran the same experiment as earlier, except I allowed identity to range from 70% to 100%. I ran Probably if the user is familiar with mashmap they would understand these limitations (or how to tweak the params to avoid them). But this doesn't seem very clear from the current README. I get the sense, now, that my expectations were higher than were warranted, having interpreted that paragraph as "fast whole genome alignment with identity as low as 70%" and ignoring the rest of that section. For what it's worth, the new experiment's data is in the same place as before: https://drive.google.com/drive/folders/1z69e_uTaYdMjEdP9m5SGf2CPaEm3KiO8?usp=share_link |
Thanks for sharing the data. That's a fair point that the README right now could use some more detail/instruction. Arguably, the bigger issue here is that I haven't set up the k-mer size to be automatically adjusted based on the minimum-identity threshold. Right now, only the sketch-size is adjusted automatically. As far as the test dataset goes, I would say that this one is particularly challenging even for 70% ANI due to the lack of synteny 😅 FWIW, by dropping the kmer size to 13 it looks like most of the homologies are recovered (and with much more accurate ANI predictions). Also, this helped me notice that we should likely be adjusting the "chaining gap" to by dynamic w/ the size of the segments, so thank you for that as well! |
Regarding lack of synteny — it certainly is a contrived dataset. I intentionally made sure none of the 'planted' homologies had the same neighbors in both sequences. My interest was in seeing how well each homology would be discovered on its own. Would real data ever look like that? Probably not. But I do recall drosophila having a high degree of rearrangement, but perhaps not with such short pieces. Probably should change the title of this issue to something more meaningful. |
Contrived or not, it helped bring to light/remind me of some issues that I need to address with MashMap integration before the "1.0.0" release, so thank you for that. |
I am new to wfmash (but of course have plenty of alignment experience).
I tried an experiment in which I generated a random genome and a second with rearrangements (including random reversals) of items from the first with various lengths (from 1K to 10K) and divergences between 0 and 20%. No indels (other than the rearrangements). There are no duplications — and every bit of each genome is homologous to a unique bit of the other.
I tried aligning using
wfmash apple.fa orange.fa --map-pct-id=70
and no alignments at all were reported. Total genome length was ≈400K.I should note that I built wfmash from a clone of just a couple days ago (v0.13.0-3-gc18520b), and that I have run a few other experiments mapping similarly diverged items, but as separate reads, to the random genome, and I do get alignments in that case (though not as many as I might expect).
I notice in issue #161 an Oct/24/2022 post that shows a dot plot of a whole genome alignment that looks mostly syntenic. Which makes me wonder if wfmash assumes end-to-end synteny.
More generally I'm trying to understand how I need to parameterize wfmash to find all the homologies in my sequences, and to understand what its limitations are w.r.t. divergence and lengths.
The text was updated successfully, but these errors were encountered: