Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

script: extract the maximum sequence extent around a set of hashes #2413

Open
ctb opened this issue Dec 17, 2022 · 6 comments
Open

script: extract the maximum sequence extent around a set of hashes #2413

ctb opened this issue Dec 17, 2022 · 6 comments
Labels
code herein lies code

Comments

@ctb
Copy link
Contributor

ctb commented Dec 17, 2022

extract-max-extent-around-hashes.py is a potentially useful script that does the following:

Given a query sig, and a target set of contigs,

extract the maximum extent of actual DNA sequence from the contigs that
would generate the hashes in the query sig. For example, for a single hash,
this would be all DNA sequence between the neighboring hashes, but NOT
including the k-mers that generated those neighboring hashes. If multiple
hashes in a row matched, it would include all sequence up to the surrounding
(non-matching) k-mers.

The motivation is to extract the maximum potentially alignable sequence
when two genomes share some hashes, but might be useful for other purposes,
too.

@ctb ctb added the code herein lies code label Dec 17, 2022
@ccbaumler
Copy link
Contributor

Been thinking. This could also be informative in identifying QTLs at the metapangenome level (or narrower) when comparing phenotypes.

@ctb
Copy link
Contributor Author

ctb commented Jan 2, 2023

Been thinking. This could also be informative in identifying QTLs at the metapangenome level (or narrower) when comparing phenotypes.

hot take response: that's already what the hashes themselves (hopefully) do. This is more about what you do after that!

@ccbaumler
Copy link
Contributor

ccbaumler commented Aug 4, 2023

How can I skip the contig hashing and take prefetch output --save-matching-hashes file as the "contig hashes" input?

Two example files (plant.prefetch.thresh0.matching.hashes.txt and 1-A03_S17_L001_001.51.sig.gz) may be found here

Or is there another way of identifying the genomic region of matching hashes?

@ctb
Copy link
Contributor Author

ctb commented Aug 5, 2023

hot take: you need the contig sequence itself so that it can be output with -o, right? and hashing the contigs should not be a big computational deal...?

@ccbaumler
Copy link
Contributor

Sure. Though my question is can I return all the unknown contiguous sequences from all the hashes I have from prefetch.

I think I'll need to write a different script for that though.

@ctb
Copy link
Contributor Author

ctb commented Aug 29, 2024

moving script into plugin: [sourmash_plugin_hashannot]

command extract_surrounding, being added in sourmash-bio/sourmash_plugin_hashannot#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code herein lies code
Projects
None yet
Development

No branches or pull requests

2 participants