script: extract the maximum sequence extent around a set of hashes #2413

ctb · 2022-12-17T19:16:03Z

extract-max-extent-around-hashes.py is a potentially useful script that does the following:

Given a query sig, and a target set of contigs,

extract the maximum extent of actual DNA sequence from the contigs that
would generate the hashes in the query sig. For example, for a single hash,
this would be all DNA sequence between the neighboring hashes, but NOT
including the k-mers that generated those neighboring hashes. If multiple
hashes in a row matched, it would include all sequence up to the surrounding
(non-matching) k-mers.

The motivation is to extract the maximum potentially alignable sequence
when two genomes share some hashes, but might be useful for other purposes,
too.

ccbaumler · 2022-12-29T21:39:14Z

Been thinking. This could also be informative in identifying QTLs at the metapangenome level (or narrower) when comparing phenotypes.

ctb · 2023-01-02T17:38:07Z

Been thinking. This could also be informative in identifying QTLs at the metapangenome level (or narrower) when comparing phenotypes.

hot take response: that's already what the hashes themselves (hopefully) do. This is more about what you do after that!

ccbaumler · 2023-08-04T20:03:49Z

How can I skip the contig hashing and take prefetch output --save-matching-hashes file as the "contig hashes" input?

Two example files (plant.prefetch.thresh0.matching.hashes.txt and 1-A03_S17_L001_001.51.sig.gz) may be found here

Or is there another way of identifying the genomic region of matching hashes?

ctb · 2023-08-05T11:18:40Z

hot take: you need the contig sequence itself so that it can be output with -o, right? and hashing the contigs should not be a big computational deal...?

ccbaumler · 2023-08-05T16:13:23Z

Sure. Though my question is can I return all the unknown contiguous sequences from all the hashes I have from prefetch.

I think I'll need to write a different script for that though.

ctb · 2024-08-29T14:30:23Z

moving script into plugin: [sourmash_plugin_hashannot]

command extract_surrounding, being added in sourmash-bio/sourmash_plugin_hashannot#1

ctb added the code herein lies code label Dec 17, 2022

ctb mentioned this issue Aug 5, 2023

sourmash plugins - ideas dumping ground #2453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

script: extract the maximum sequence extent around a set of hashes #2413

script: extract the maximum sequence extent around a set of hashes #2413

ctb commented Dec 17, 2022

ccbaumler commented Dec 29, 2022

ctb commented Jan 2, 2023

ccbaumler commented Aug 4, 2023 •

edited

Loading

ctb commented Aug 5, 2023

ccbaumler commented Aug 5, 2023

ctb commented Aug 29, 2024

script: extract the maximum sequence extent around a set of hashes #2413

script: extract the maximum sequence extent around a set of hashes #2413

Comments

ctb commented Dec 17, 2022

ccbaumler commented Dec 29, 2022

ctb commented Jan 2, 2023

ccbaumler commented Aug 4, 2023 • edited Loading

ctb commented Aug 5, 2023

ccbaumler commented Aug 5, 2023

ctb commented Aug 29, 2024

ccbaumler commented Aug 4, 2023 •

edited

Loading