Skip to content

Commit

Permalink
address david comments
Browse files Browse the repository at this point in the history
  • Loading branch information
luizirber committed Sep 25, 2020
1 parent 7751a82 commit 757e32b
Show file tree
Hide file tree
Showing 7 changed files with 84 additions and 40 deletions.
13 changes: 7 additions & 6 deletions thesis/01-scaled.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -218,8 +218,8 @@ and doesn't include many required features for working with real biological data
but its smaller code base makes it a more readable and concise example of the method.
For _Mash Screen_ the ratio of hashes matched by total hashes is used instead of the _Containment Score_,
since the latter uses a $k$-mer survival process modeled as a Poisson process
first introduced in the _Mash distance_ [@ondov_mash:_2016] and later adapted as
the _Containment score_ [@ondov_mash_2019].
first introduced in [@fan_assembly_2015] and later used in the _Mash distance_ [@ondov_mash:_2016] and
_Containment score_ [@ondov_mash_2019] formulations.

Experiments use $k=\{21, 31, 51\}$
(except for Mash, which only supports $k \le 32$).
Expand All @@ -242,7 +242,8 @@ since `Mash` doesn't support $k>32$.
**B**: Excluding low coverage genomes identified in previous articles.

All methods are within 1\% of the exact containment on average (Figure \ref{fig:minhash1000} A),
with `CMash` consistently underestimating the containment for all $k$ values.
with `CMash` consistently underestimating the containment for large $k$ and overestimating for
small $k$.
`Mash Screen` with $n=10000$ has the smallest difference to ground truth for $k=\{21, 31\}$,
followed by `smol` with `scaled=1000` and `Mash Screen` with $n=1000$.

Expand All @@ -260,7 +261,7 @@ with all methods mostly within 1\% absolute difference to the ground truth.
### Scaled MinHash sketch sizes across GenBank domains

In order to compare sketch sizes across assembled genomes and where they are too
small to be properly estimate containment and similarity,
small to properly estimate containment and similarity,
514,209 assembled genomes (
5,234 Archaea,
464,485 Bacteria,
Expand All @@ -272,7 +273,7 @@ Figure \ref{fig:scaledGenBankSizesBoxen} show letter-value plots [@hofmann_lette
The mean sketch size ranges from 20 for viruses to 18,115 for protozoans,
indicating that _Scaled MinHash_ sketches at $scaled=2000$ are too small for accurate containment and similarity estimates
in most viral genomes,
but are large enough for archaea ($mean=767$) and larger organisms.
but are large enough for archaea (mean sketch size = $767$) and larger organisms.

```{r scaledGenBankSizesBoxen, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, fig.cap='(ref:genbank-sizes-boxen)', out.width="100%", fig.show="hold", fig.align="center"}
knitr::include_graphics('../experiments/sizes/figures/sizes.pdf')
Expand Down Expand Up @@ -415,7 +416,7 @@ including taxonomic classification in metagenomes and large scale indexing and s
There are two compatible versions,
one in Python and another in Rust,
due to performance requirements when processing large datasets (like metagenomes).
Both versions of the Scaled MinHash implementations use each language standard library sets
Both versions of the Scaled MinHash implementations use each language's standard library sets
(`set` for Python, `HashSet` for Rust)
for storing hashes and efficient set operations (intersection and difference).
Experiments used the Rust version for calculating Scaled MinHash sketches,
Expand Down
33 changes: 17 additions & 16 deletions thesis/02-index.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ to whit: large scaled works well with LCA (small DB, ~tolerable memory, load all
but low scaled may work (much) better with SBT.
-->

[@marchet_data_2019] classifies indexing strategies for querying large collections of
sequencing datasets into _$k$-mer aggregative_ methods and _color aggregative_ methods.
Indexing strategies for querying large collections of sequencing datasets
can be classified as _$k$-mer aggregative_ methods and _color aggregative_ methods [@marchet_data_2019].
_$k$-mer aggregative_ methods index the $k$-mer composition for each individual dataset,
and then build structures for retrieving the datasets where the query $k$-mers are present.
_Color aggregative_ methods index the $k$-mer composition for all datasets,
Expand Down Expand Up @@ -83,7 +83,7 @@ with network nodes containing a subtree of the original tree and only being
accessed if the search requires it.

For genomic contexts,
an hierarchical index is a _$k$-mer aggregative_ method,
a hierarchical index is a _$k$-mer aggregative_ method,
with datasets represented by the $k$-mer composition of the dataset and stored in a data structure that allows querying for $k$-mer presence.
The Sequence Bloom Tree [@solomon_fast_2016] adapts Bloofi for genomics and rephrases the search problem as _experiment discovery_:
given a query sequence $Q$ and a threshold $\theta$,
Expand All @@ -93,7 +93,7 @@ and queries are transcripts.

Further developments of the SBT approach focuses on clustering similar datasets to prune search
early [@sun_allsome_2017] and developing more efficient representations for the
internal nodes [@solomon_improved_2017] [@harris_improved_2018] to use less storage space and memory.
internal nodes [@solomon_improved_2017] and clustering [@harris_improved_2018] to use less storage space and memory.

<!--
example figure for SBT:
Expand Down Expand Up @@ -121,11 +121,11 @@ For efficiency,
$k$-mers are typically hashed and its integer representation (_hash_) is used
instead.

kraken [@wood_kraken:_2014] has a special case of this structure,
`Kraken` [@wood_kraken:_2014] has a special case of this structure,
using a taxonomic ID (taxon) for representing dataset identity.
Datasets share the same ID if they belong to the same taxon,
and if a hash is present in more than one dataset
kraken reduces the list of taxons to the lowest common ancestor (LCA),
Kraken reduces the list of taxons to the lowest common ancestor (LCA),
which lowers memory requirements for storing the index.
This LCA approach leads to decreased precision and sensitivity over time [@nasko_refseq_2018],
since new datasets are frequently added to reference databases and the chance of a k-mer being present in multiple datasets increases.
Expand All @@ -139,7 +139,7 @@ an alternative to Bloom Filters that also support counting and resizing.

## Specialized indices for Scaled MinHash sketches

sourmash [@brown_sourmash:_2016] is a software for large-scale sequence data comparisons based on MinHash sketches.
`sourmash` [@brown_sourmash:_2016] is a software for large-scale sequence data comparisons based on MinHash sketches.
Initially implementing operations for computing,
comparing and plotting distance matrices for _MinHash_ sketches,
version 2 [@pierce_large-scale_2019] introduces _Scaled MinHash_ sketches
Expand All @@ -153,7 +153,8 @@ The simplest index is the `LinearIndex`,
a list of signatures.
Search operations are executed sequentially,
and insertions append new signatures to the end of the list.
Internally sourmash uses LinearIndex as the default index for lists of
Internally,
`sourmash` uses LinearIndex as the default index for lists of
signatures provided in the command-line.

#### MinHash Bloom Tree
Expand All @@ -163,8 +164,8 @@ signatures provided in the command-line.
The _MinHash Bloom Tree_ (_MHBT_) is a variation of the _Sequence Bloom Tree_ (_SBT_)
that uses Scaled MinHash sketches as leaf nodes instead of Bloom Filters as in
the SBT.
The search operation in SBTs is defined as a breadth-first search starting in the root of the tree,
using a threshold of the original k-mers in the query to decide when to prune the search.
The search operation in SBTs is defined as a breadth-first search starting at the root of the tree,
using a threshold of the original $k$-mers in the query to decide when to prune the search.
MHBTs use a query Scaled MinHash sketch instead,
but keep the same search approach.
The threshold of a query $Q$ approach introduced in [@solomon_fast_2016] is
Expand Down Expand Up @@ -211,7 +212,7 @@ with containment or similarity to the query calculated against the rebuilt signa

mash screen [@ondov_mash_2019] has a similar index,
but it is constructed on-the-fly using the distinct hashes in a sketch collection as keys,
and values are counters initialized set to zero.
and values are counters initially set to zero.
As the query is processed,
matching hashes have their counts incremented,
and after all hashes in the query are processed then all the sketches in the collection are
Expand Down Expand Up @@ -311,21 +312,21 @@ using appropriate queries for each domain.
All queries were selected from the relevant domain and queried against both MHBT ($scaled=1000$) and LCA ($scaled=10000$),
for $k=21$.

Table: (\#tab:search-runtime) Running time for similarity search
Table: (\#tab:search-runtime) Running time in seconds for similarity search
using LCA ($scaled=10000$) and MHBT ($scaled=1000$) indices.

| | Viral | Archaea | Protozoa | Fungi | Bacteria |
|:----------|-----------:|-----------:|-----------:|-------------:|--------------:|
| LCA | 1.06 | 1.42 | 5.40 | 26.92 | 231.26 |
| SBT | 1.32 | 3.77 | 43.51 | 244.77 | 3,185.88 |

Table: (\#tab:search-memory) Memory consumption for similarity search
Table: (\#tab:search-memory) Memory consumption in megabytes for similarity search
using LCA ($scaled=10000$) and MHBT ($scaled=1000$) indices.

| | Viral | Archaea | Protozoa | Fungi | Bacteria |
|:----------|--------:|--------:|---------:|----------:|--------------:|
| LCA | 223,484 | 240,088 | 797,816 | 3,274,388 | 20,925,984 |
| SBT | 162,628 | 125,280 | 332,080 | 1,655,788 | 2,289,904 |
| LCA | 223 | 240 | 798 | 3,274 | 20,926 |
| SBT | 163 | 125 | 332 | 1,656 | 2,290 |

Table \@ref(tab:search-runtime) shows running time for both indices.
For small indices (Viral and Archaea) the LCA running time is dominated by loading the index in memory,
Expand Down Expand Up @@ -376,7 +377,7 @@ especially for operations that need to summarize $k$-mer assignments or require

Due to these characteristics,
and if memory usage is not a concern,
then the LCA index is the most appropriate choice since it's faster.
then the LCA index is the most appropriate choice since it is faster.
The MHBT index is currently recommended for situations where memory is limited,
such as with smaller scaled values ($s\le2000$)
that increase the size of signatures,
Expand Down
24 changes: 20 additions & 4 deletions thesis/03-gather.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,11 @@ $k$-mer assignments matching multiple taxons from a reference database
with an option to reduce it further to the LCA [@kim_centrifuge_2016].

Once each sequence (from raw reads or assembled contigs) has a taxonomic assignment,
<!-- David comment: "Not all profiles classify sequences. See Focus, Quikr, Metalign, Metapalette, etc" -->
these methods resolve the final identity and abundance for each member of the community by summarizing the assignments to a specific taxonomic rank,
Taxonomic profiling is fundamentally limited by the availability of reference datasets to be used for assignments,
and reporting what percentage of the sample is unassigned is important to assess results,
especially in under characterized environments such as oceans and soil.
especially in undercharacterized environments such as oceans and soil.

### Decomposition of queries with gather

Expand All @@ -41,8 +42,12 @@ Methods summarizing taxonomic assignments from sequences in the query metagenome
starting from the $k$-mer composition of the query,
it iteratively finds a match in a collection of datasets with the largest _containment_ of the query (most elements in common),
and create a new query by _removing elements_ in the match from the original query.
The process stop when the new query doesn't have any more matches in the collection,
The process stops when the new query doesn't have any more matches in the collection,
or a user-provided minimum detection threshold is reached.
<!-- David comment: "I'm surprised this works, since back in 2015 (Metapalette
days) I found removing elements like this caused the approach to fall apart when
closely-related organisms are in the metagenome.)
-->
This approach differs from previous methods because the co-occurrence of $k$-mers
in a match is considered a strong signal that they are coming from the same organism in the original sample,
and is used instead of the LCA-based methods to resolve ambiguities in the taxonomic assignment of a sequence (or its $k$-mers).
Expand Down Expand Up @@ -103,8 +108,6 @@ and based on that assignment the extra step associates a taxonomic ID
and a taxonomic lineage (a path from root to taxonomic ID) derived from a specific taxonomy.
After a lineage is available,
each taxonomic rank is summarized based on the abundances under it.
This information is used to build a profile in the CAMI profiling format,
but can also be adapted for other uses and tools.

<!-- TODO expand a bit -->

Expand Down Expand Up @@ -360,6 +363,10 @@ to existing methods.
Other containment estimation methods described in Chapter [1](#chp-scaled),
_CMash_ [@koslicki_improving_2019] and _mash screen_ [@ondov_mash_2019],
can also implement `gather`.
<!-- David comment: "CMash does kinda, but uses unique k-mers instead of
removing matches like gather does. CMash commit
https://github.com/dkoslicki/CMash/commit/de7bdd6fa
-->
Running a search requires access to the original dataset (_mash screen_) for the query,
or a Bloom Filter derived from the original dataset (_CMash_),
and when the collection of reference sketches is updated the Bloom Filter from _CMash_ can be reused,
Expand Down Expand Up @@ -446,6 +453,15 @@ and additional steps must be taken to disambiguate matches.
The availability of abundance counts for each element in the _Scaled MinHash_ is not well explored,
since the process of _removing elements_ from the query doesn't account for them
(the element is removed even if the count is much higher than the count in the match).
<!-- David comment: could use a compressive sensing approach here:
$ min \norm{x}^2_1 + \lambda \norm{Ax - y}^2_2, x \ge 0$
Y_i = count of hash i in sample
A_ij = count of hash i in genome j
convert to least squares and use Lawson and Hanson for blistering speed!
-->
Both the multiple match as well as the abundance counts issues can benefit from
existing solutions taken by other methods,
like the _species score_ (for disambiguation) and _Expectation-Maximization_ (for abundance analysis)
Expand Down
14 changes: 7 additions & 7 deletions thesis/04-distributed.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ since a signature retains a much smaller subset of the original data
and once a chunk of the data is processed it is not necessary to keep it or refer to it again.
The calculation process is frequently I/O bound on reading data,
unless very fast access to the data is available,
when it becomes CPU-bound on hashing algorithm.
when it becomes CPU-bound on the hashing algorithm.
sourmash signatures are also typically small,
with input sizes in the order of hundreds of times larger than the final signature.
This creates opportunities for approaches where distributed workers do all the expensive data access,
Expand Down Expand Up @@ -96,7 +96,7 @@ depending on what database is being calculated.
For IMG the data was available in a shared filesystem,
so `sourmash compute` could access it directly.

For the SRA the worker uses `fastq-dump` (from the `sra toolkit`) to download the data and stream it through a UNIX pipe into a `sourmash compute` process.
For the SRA, the worker uses `fastq-dump` (from the `sra toolkit`) to download the data and stream it through a UNIX pipe into a `sourmash compute` process.
Since the data is being streamed,
it is not being stored locally by the worker,
so if a signature need to be recalculated the data needs to be downloaded again.
Expand Down Expand Up @@ -160,7 +160,7 @@ but not as appropriate for a system that is frequently updated.

`wort` was created to fix some of the issues with the initial `soursigs` prototype
and allow more computational systems to participate in the process,
building an heterogeneous system that doesn't depend on similar resources being available for each worker.
building a heterogeneous system that doesn't depend on similar resources being available for each worker.
The two major changes were:

- encapsulate workers in a Docker container [@boettiger_introduction_2015],
Expand Down Expand Up @@ -234,7 +234,7 @@ It is worth noting that signatures were calculated with a $\mathbf{scaled}=1000$
but $332/2.4=138$,
more than expected if considering only final file size.
Although each signature is using this scaled value,
there are 3 Scaled MinHash sketches for each dataset (for $k={21,31,51}$),
there are 3 Scaled MinHash sketches for each dataset (for $k=\{21,31,51\}$),
and each sketch is also tracking abundance for each hash,
which doubles the storage requirements.

Expand Down Expand Up @@ -268,14 +268,14 @@ knitr::include_graphics('figure/wortMetagSizes.png')

448 thousand datasets from the metagenomic subset were processed from 237 TB of original sequencing data,
totalling 2.1 TB of signatures.
Building a sourmash index (either revindex or SBT) with this amount of data is
Building a sourmash index (either LCA or MHBT) with this amount of data is
prohibitive at the moment,
since both indices load all the data in memory before writing the indices to storage.
This is an implementation detail,
and can be addressed in future versions of sourmash.

In the same vein,
executing sourmash operations over that many signatures is not an use case that was previously considered,
executing sourmash operations over that many signatures is not a use case that was previously considered,
so even operations over a list of signatures (and which don't strictly require indices)
like `sourmash search` end up requiring too much memory when using hundreds of
thousands of metagenome signatures.
Expand Down Expand Up @@ -318,7 +318,7 @@ If there is access to additional machines (in a cluster, for example)
it is also possible to parallelize the search even further by splitting the metagenomes into groups,
and running each group in a distinct machine.

The search found 1,407 of the original 2,631 MAGs above 50% containment, in 2,938 metagenomes
The search with $k=31$ found 1,407 of the original 2,631 MAGs above 50% containment, in 2,938 metagenomes
(after removing the Tara Oceans datasets from the results, since the MAGs were derived from them).
The 50% threshold was chosen to account for sequencing errors and biological variability,
since Scaled MinHash sketches use exact k-mer matching when calculating containment.
Expand Down
4 changes: 2 additions & 2 deletions thesis/05-decentralized.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ Signatures can also change if other parameters for calculating their sketches ar
but having a default set of "good enough" parameters like the ones defined in `wort`
allows general queries that can later be refined once a candidate subset of the database is found.

```{r sbtInsert, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="60%", fig.cap="Inserting a dataset in an MHBT"}
```{r sbtInsert, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="70%", fig.align="center", fig.cap="Inserting a dataset in an MHBT"}
knitr::include_graphics('figure/sbtmh_insert.png')
```
Insertion of new signatures to an MHBT trigger changes in the path from new leaf
Expand Down Expand Up @@ -314,7 +314,7 @@ and to measure the overhead that IPFS imposes when used in conditions more simil
when all the data is available.

Table: (\#tab:sbt-ipfs) Performance of MHBT operations with IPFS storage.
Units in seconds, with fold-difference to the baseline in parenthesis.
Units in seconds, with fold-difference to the baseline in parentheses.

| Experiment | takver | datasilo | rosewater |
|:---------------|-------:|-----------:|------------:|
Expand Down
Loading

0 comments on commit 757e32b

Please sign in to comment.