command | functionality | docs |
---|---|---|
manysketch |
Rapidly build sketches for many input files | link |
singlesketch |
Sketch a single sample | link |
fastgather |
Multithreaded gather of one metagenome against a database |
link |
fastmultigather |
Multithreaded gather of multiple metagenomes against a database |
link |
manysearch |
Multithreaded containment search for many queries in many large metagenomes | link |
multisearch |
Multithreaded comparison of multiple sketches, in memory | link |
pairwise |
Multithreaded pairwise comparison of multiple sketches, in memory | link |
cluster |
cluster sequences based on similarity data from pairwise or multisearch |
link |
index |
build a RocksDB inverted index for efficient containment queries | link |
This repository implements multithreaded plugins for
sourmash that provide very fast
implementations of sketch
, search
, and gather
. These commands
are typically hundreds to thousands of times faster, and 10-50x lower
memory, than the current sourmash code. For example, a gather
of
SRR606249 with sourmash v4.8.6 against GTDB rs214 takes 40 minutes and
14 GB of RAM, while fastgather
with 64 cores takes only 2 minutes
and 2 GB of RAM.
The main drawback to these plugin commands is that their inputs are more restricted than the native sourmash commands, and in some cases their outputs are in a different format. This means that your input files may need to be prepared differently, and the output may need to be processed differently. The plugin commands are also a bit less user friendly, because (for now) we're more focused on speed than polish and user experience.
sourmash supports a variety of different storage formats for sketches (see sourmash docs), and the branchwater plugin works with some (but not all) of them. Branchwater also supports an additional database type, a RocksDB-based inverted index, that is not (yet) supported natively by sourmash (through v4.8.11).
We recommend using zip files or standalone manifest CSVs pointing to zip files whenever you need to provide multiple sketches.
command | command input | database format |
---|---|---|
manysketch |
CSV with input fasta/fastq paths (details below) | produces Zip database |
gather |
Single metagenome in sig, zip, or pathlist | Zip or pathlist |
fastmultigather |
Multiple metagenomes in sig, zip, or pathlist | Zip, pathlist, or rocksdb index |
manysearch |
Multiple genomes in sig, zip, or pathlist | Zip, pathlist, or rocksdb index |
multisearch |
Multiple sketches in sig, zip, or pathlist | Multiple sketches in sig, zip, or pathlist |
pairwise |
Multiple sketches in sig, zip, or pathlist | N/A |
cluster |
Output from pairwise or multisearch |
N/A |
index |
Multiple sketches in sig, zip, or pathlist | N/A |
When working with large collections of small sketches such as genomes, we suggest using zipfiles as produced by sourmash (e.g. using sourmash sig cat
or manysketch
). Zip files have a few nice features:
- sketches are compressed in zip files;
- zip files can contain many sketches, including incompatible types (e.g. multiple k-mer sizes);
- subsets of zip files can be efficiently selected and loaded;
- in particular, single sketches can be loaded on demand, supporting lower memory requirements for certain kinds of searches.
For all these reasons, zip files are the most efficient and effective basic storage type for sketches in sourmash, and the branchwater plugin fully supports them!
You can create zipfiles with sourmash like so:
sourmash sig cat <list of sketches> -o sigs.zip
The branchwater plugin commands take a relatively restricted set of inputs, compared to sourmash: they take a single file for query and a single file for search. These files must be a collection of sketches (in .sig/.sig.gz format), a RocksDB database, a pathlist, or a standalone manifest.
When you want to search subsets or deal with many large files, we recommend standalone manifests over constructing new databases or using pathlists to point at subsets of files. Why?
There are three main reasons NOT to use zip files or pathlists in this case:
- first, metagenome sketches are often extremely large (100s of MBs to GBs), and it is not ideal to zip them all into a single zip file;
- second, both
manysearch
andfastmultigather
take a single argument that specifies collections of metagenomes which need to be loaded on demand, because they cannot fit into memory; - third, when searching subsets of large collections, it can be expensive to make a copy of the subset;
Manifests are a sourmash filetype that catalog sketch content and can
be used to point at many sketches and/or subset large collections of
sketches, and so they provide solutions to these problems. In
particular, manifests let you provide large collections of
sketches/collections of large sketches to manysearch
and
fastmultigather
, and also let you select a subset of a collection
without making a copy.
The branchwater plugin supports manifest CSVs. These can be created
from lists of sketches by using sourmash sig collect
or sourmash sig check
; for example,
sourmash sig check <database.sig.zip -F csv --picklist idents.csv:ident:ident -m subset.csv
will create a manifest CSV containing a subset of sketches using picklists,
and
find -type f /path/to/sig/files/* > pathlist.txt
sourmash sig collect pathlist.txt -o summary-manifest.csv -F csv
will collect a list of all of the sketches under /path/to/sig/files
and make the list available through a combined manifest.
Note here that manifests are much smaller than the files containing all of the sketches!
Note also that manifests have many advantages over pathlists: in particular, they contain metadata that enables fast loading of specific sketches, and they support subsetting from large databases; pathlists support neither.
The branchwater plugin also supports a database type that is not yet supported by sourmash: inverted indexes stored in a RocksDB database. These indexes provide fast and low-memory lookups when searching very large datasets, and are used for the branchwater petabase scale search hosted at branchwater.sourmash.bio.
Some commands - fastmultigather
and manysearch
- support using
these RocksDB-based inverted indexes for efficient search, and they
can be created by running sourmash scripts index
. See
the index
documentation, below.
Note: We no longer recommend using "pathlists". Use zip files or standalone manifests instead.
You can make a pathlist by listing a collection of .sig.gz files like so:
find /path/to/directory/ -name "*.sig.gz" -type f > directory.txt
When using a pathlist for search, we load all signatures into memory
at the start in order to generate a manifest. To avoid memory issues,
the signatures are not kept in memory, but instead re-loaded as
described below for each command (see: Notes on concurrency and
efficiency). This makes using pathlists less efficient than zip
files.
The manysketch
command sketches a list of FASTA/FASTQ files into a
zipped sourmash signature collection (zip
). manysketch
uses one
thread per input file, so it can (very) efficiently sketch many files
at once; and, because sequence file parsing is entirely implemented in
Rust, it is much, much faster than sourmash sketch
for large FASTQ
files. However, it does not currently support translation,
i.e. protein signature generation from DNA FASTA.
manysketch
can be used to sketch collections of genomes, pairs of R1/R2
files, and more flexible collections based on prefix - which approach is
used depends on the columns provided to manysketch
in the input CSV file.
To run manysketch
, you need to build a text file list of FASTA/FASTQ files (see manysketch.csv
example, below).
The following formats are accepted:
- 3 columns:
name,genome_filename,protein_filename
genome_filename
entries are considered DNA FASTA,protein_filename
entries are considered protein FASTA. - 3 columns:
name,read1,read2
All entries considered DNA FASTA, and both
read1
andread2
files are used as input for a single sketch with namename
. - 4 columns:
name,input_moltype,prefix,exclude
This filetype uses
glob
to find files that matchprefix
but do not matchexclude
. As such,*
are ok in theprefix
andexclude
columns. Since we are dealing with "prefixes" here, we automatically search with*
on the end of theprefix
entry.
A simple way to build a manysketch input file for a directory is this command snippet:
echo name,genome_filename,protein_filename > manysketch.csv
for i in *.fa.gz
do
echo $i,$i,
done >> manysketch.csv
You can then run:
sourmash scripts manysketch manysketch.csv -o fa.zip
The output will be written to fa.zip
You can check if all signatures were written properly with
sourmash sig summarize fa.zip
To modify sketching parameters, use --param-str
or -p
and provide valid param string(s)
sourmash scripts manysketch fa.csv -o fa.zip -p k=21,k=31,k=51,scaled=1000,abund -p protein,k=10,scaled=200
See the sourmash sketch docs for more information on param strings.
manysketch
also supports building independent sketches for each record in a FASTA file (--singleton
).
You can run:
sourmash scripts manysketch manysketch.csv -o fa.zip --singleton
The output will be written to fa.zip
You can check if all signatures were written properly with
sourmash sig summarize fa.zip
The number of sketches per parameter combination should equal the total number of records in all input FASTA.
The name
column will not be used. Instead, each sketch will be named from the FASTA record name.
manysketch
supports all sourmash moltypes: protein
, hp
, and dayhoff
. See also sourmash
protein encoding documentation and sourmash
parameter documentation for more information about what these "moltypes" mean and their default parameters.
manysketch
does not translate DNA to protein, sorry. You'll need to do that ahead of time.
If you have a proteins.csv
file which looks like:
name,genome_filename,protein_filename
protein1,,protein1.fa
protein2,,protein2.fa
You can run:
sourmash scripts manysketch proteins.csv -o proteins.zip -p dayhoff,k=16 -p protein,k=10 -p hp,k=42
The output will be written to proteins.zip
You can check if all signatures were written properly with
sourmash sig summarize proteins.zip
In this case, three sketches of protein
, dayhoff
, and hp
moltypes were made for each file in proteins.csv
and saved to proteins.zip
.
The singlesketch
command generates a sketch for a single sample, with one or more input FASTA/FASTQ files.
sourmash scripts singlesketch input.fa -p k=21,scaled=1000,dna -o output.sig --name signature_name
You can use -
for stdin
and output the result to stdout
:
cat input.fa | sourmash scripts singlesketch - -o -
The multisearch
command compares one or more query genomes, and one or more subject genomes. It differs from manysearch
because it loads everything into memory.
multisearch
takes two input collections and outputs a CSV:
sourmash scripts multisearch query.sig.gz database.zip -o results.csv
The results file results.csv
, will have 8 columns: query
and
query_md5
, match
and match_md5
, and containment
, jaccard
,
max_containment
, and intersect_hashes
.
The pairwise
command does the same comparisons as multisearch
but
takes only a single collection of sketches, for which it calculates
all the pairwise comparisons. Since the comparisons are symmetric, it
is approximately twice as fast as multisearch
.
The -t/--threshold
for multisearch
and pairwise
applies to the
containment of query-in-target and defaults to 0.01. To report any
overlap between two sketches, set the threshold to 0.
The fastgather
command is parallelized (and typically much faster)
version of sourmash gather
.
fastgather
takes a single query metagenome and a database, and outputs a CSV:
sourmash scripts fastgather query.sig.gz database.zip -o results.csv --cores 4
fastgather
outputs the same columns as sourmash gather
, with only a few exception
match_name
is output instead ofname
;match_md5
is output instead ofmd5
;match_filename
is output instead offilename
, and the value is different;potential_false_negative
is not present infastgather
output;
fastmultigather
takes a collection of query metagenomes and a collection of sketches as a database, and outputs a CSV file containing all the matches.
sourmash scripts fastmultigather queries.manifest.csv database.zip --cores 4 --save-matches -o results.csv
We suggest using standalone manifest CSVs wherever possible, especially if the queries are large.
The main advantage that fastmultigather
has over running
fastgather
on multiple queries is that fastmultigather
only needs
to load the database once for all queries, unlike with fastgather
;
this can be a significant time savings for large databases.
fastmultigather
will output a gather file containing all results in
one file, specified with -o/--output
. fastmultigather
gather CSVs
provide the same columns as fastgather
, above.
In addition, on a database of sketches (but not on RocksDB indexes)
fastmultigather
will output a prefetch
file containing all
overlapping matches between that query and the database. The prefetch
CSV will be named {signame}.prefetch.csv
, where {signame}
is the
name of your sourmash signature.
--save-matches
is an optional flag that will save the matched hashes
for each query in a separate sourmash signature
{signame}.matches.sig
. This can be useful for debugging or for
further analysis.
Warning: At the moment, if two different queries have the same
{signame}
, the output files for one query will be overwritten by
the results from the other query. The behavior here is undefined in
practice, because of multithreading: we don't know what queries will
be executed when or files will be written first.
The manysearch
command compares one or more collections of query
sketches, and one or more collections of subject sketches. It is the
core command we use for searching petabase-scale databases of
metagenomes for contained genomes.
manysearch
takes two collections as input, and outputs a CSV:
sourmash scripts manysearch queries.zip metagenomes.manifest.csv -o results.csv
We suggest using a manifest CSV for the metagenome collection.
The results file here, query.x.gtdb-reps.csv
, will have the
following columns: query
, query_md5
, match_name
, match_md5
,
containment
, jaccard
, max_containment
, intersect_hashes
,
query_containment_ani
.
If you run manysearch
without using a RocksDB database (that is,
against regular sketches), the results file will also have the
following columns: , match_containment_ani
,
average_containment_ani
, and max_containment_ani
.
Finally, if using sketches that have abundance information, the
results file will also contain the following columns: average_abund
,
median_abund
, std_abund
, n_weighted_found
, and
total_weighted_hashes
.
See the prefetch CSV output column documentation for information on these various columns.
The -t/--threshold
for manysearch
applies to the containment of
query-in-database (e.g. genome in metagenome) and defaults to 0.01.
To report any overlap between two sketches, set the threshold to 0.
(This will produce many, many results when searching a collection of
metagenomes!)
Using -A/--output-all-comparisons
will ignore the threshold parameter
and output all comparisons done. Against a RocksDB database, only matches
with some overlap will be reported; with collections of sketches, all
pairs will be reported.
By default, manysearch
will display the contents of the CSV file in a
human-readable format. This can be disabled with -N/--no-pretty-print
when executing large searches.
The cluster
command conducts graph-based clustering via the sequence
similarity measures in pairwise
or multisearch
outputs. It is a
new command and we are exploring its utility.
cluster
takes the csv output of pairwise
or multisearch
input,
and outputs two CSVs:
-o
,--output
will contain the names of the clusters and theident
of each sequence included in the cluster (e.g.Component_1, name1;name2
)
cluster,nodes
Component_1,name1;name2;name3
Component_2,name4
--cluster-sizes
will contain information on cluster size, with a counts for the number of clusters of that size. For the two clusters above, the counts would look like this:
cluster_size,count
3,1
1,1
cluster
takes a --similarity_column
argument to specify which of
the similarity columns, with the following choices: containment
,
max_containment
, jaccard
, average_containment_ani
,
maximum_containment_ani
. All values should be input as fractions
(e.g. 0.9 for 90%)
The index
subcommand creates a RocksDB inverted index that can be
used as an efficient database for manysearch
(containment queries
into mixtures) and fastmultigather
(mixture decomposition against a
database of genomes).
RocksDB inverted indexes support fast, low-latency, and low-memory queries, in exchange for a time-intensive indexing step and extra disk space. They are also used for the branchwater.sourmash.bio real-time SRA metagenome query.
Note that RocksDB indexes do not support abundance estimation for containment queries.
To run index
, provide it with multiple sketches in sig, zip, or
pathlist format, and specify the desired output directory; we suggest
using the .rocksdb
extension for RocksDB databases, e.g. -o gtdb-rs214-k31.rocksdb
.
By default, index
will store a copy of the sketches
along with the inverted index. This will substantially increase the
disk space required for large databases. You can provide an optional
--no-internal-storage
to index
to store them externally, which
reduces the disk space needed for the index. Read below for technical
details!
index
can take any of the supported input types, but
unless you are using a zip file or a pathlist of JSON files, it may
need to load all the sketches into memory before indexing
them. Moreover, you can only use external storage with a zip file. We
are working on improving this; see
issue #415
for details. A warning will be printed to stderr in situations where
the sketches are being loaded into memory.
RocksDB indexes support containment queries (a la the
branchwater application),
as well as gather
-style mixture decomposition (see
Irber et al., 2022).
For this plugin, the manysearch
command supports a RocksDB index for
the database for containment queries, and fastmultigather
can use a
RocksDB index for the database of genomes.
RocksDB indexes contain references to the sketches used to construct
the index. If --internal-storage
is set (which is the default), a
copy of the sketches is stored within the RocksDB database directory;
if --no-internal-storage
is provided, then the references point to
the original source sketches used to construct the database, wherever
they reside on your disk.
The sketches are not used by manysearch
, but are used by
fastmultigather
: you'll get an error if you
run fastmultigather
against a RocksDB index where the sketches
cannot be loaded.
What this means is therefore a bit complicated, but boils down to the following two approaches:
- The safest thing to do is build a RocksDB index and use internal
storage (the default). This will consume more disk space but your
RocksDB database will always be usable for both
manysearch
andfastmultigather
, as well as the branchwater app. - If you want to avoid storing duplicate copies of your sketches,
then specify
--no-internal-storage
and provide a stable absolute path to the source sketches. This will again support bothmanysearch
andfastmultigather
, as well as the branchwater app. If the source sketches later become unavailable,fastmultigather
will stop working (althoughmanysearch
and the branchwater app should be fine).
You should (for the moment) avoid specifying relative paths to the
sketches when running index
. Follow
sourmash_branchwater_plugin#415
if better support for relative paths is of interest!
Note that RocksDB indexes are implemented in the core sourmash Rust library, and used in downstream software packages (this plugin, and the branchwater application code). The above documentation applies to sourmash core v0.15.0.
Unlike sourmash, which provides guarantees that command-line options and outputs will not change within minor versions, we make no guarantees of stability within the branchwater plugin. This is because the branchwater plugin is intended to move fast and occasionally break things.
Eventually we expect to provide all of the branchwater plugin's functionality within the sourmash package, at which time the sourmash guarantees will apply!
However, we do not expect command line options and output file formats to change quickly.
We will also endeavor to avoid changing column names in CSV output, although, we may change the order of column names on occasion. Please use the column headers (column names) to select specific columns.
Each command does things somewhat differently, with implications for
CPU and disk load; moreover, the performance will vary depending on
the database format. You can measure threading efficiency with
/usr/bin/time -v
on Linux systems, and disk load by number of
complaints received when running. Your best bet is to
just ask the team for suggestions,
but you can lightly skim the docs below and play with some small data
sets, too!
manysketch
loads one sequence file from disk per thread and sketches
it using all signature params simultaneously.
manysearch
loads all the queries at the beginning, and then loads
one database sketch from disk per thread. The
compute-per-database-sketch is dominated by I/O. So your number of
threads should be chosen with care for disk load. We typically limit
it to -c 32
for shared disks.
multisearch
loads all the queries and database sketches once, at the
beginning, and then uses multithreading to search across all matching
sequences. For large databases it is extremely efficient at using all
available cores. So 128 threads or more should work fine!
Zipfiles should work well.
pairwise
acts just like multisearch
, but only loads one file (and
then does all comparisons between all pairs within that file).
Like multisearch
and pairwise
, fastgather
loads everything at
the beginning, and then uses multithreading to search across all
matching sequences. For large databases it is extremely efficient at
using all available cores. So 128 threads or more should work fine! We
suggest using zipfiles for the database.
fastmultigather
loads the entire database once, and then loads one
query from disk per thread. The compute-per-query can be significant,
though, so multithreading efficiency here is less dependent on I/O and
the disk is less likely to be saturated with many threads. We suggest
limiting threads to between 32 and 64 to decrease shared disk load.
cluster
loads the entire file multithreaded, and then populates the
graph sequentially.
The command sourmash scripts index
makes an on-disk inverted index
for low memory fast search. Indexing takes a while, but then search
takes fewer resources.
Currently only fastmultigather
and manysearch
can use this kind of index.
fastmultigather
with this index produces a complete set of sourmash gather
columns.
We suggest using the extension .rocksdb
for these databases, as we
use RocksDB for the underlying database storage
mechanism.