address david comments

luizirber · Sep 25, 2020 · 757e32b · 757e32b
1 parent 7751a82
commit 757e32b
Show file tree

Hide file tree

Showing 7 changed files with 84 additions and 40 deletions.
diff --git a/thesis/01-scaled.Rmd b/thesis/01-scaled.Rmd
@@ -218,8 +218,8 @@ and doesn't include many required features for working with real biological data
 but its smaller code base makes it a more readable and concise example of the method.
 For _Mash Screen_ the ratio of hashes matched by total hashes is used instead of the _Containment Score_,
 since the latter uses a $k$-mer survival process modeled as a Poisson process
-first introduced in the _Mash distance_ [@ondov_mash:_2016] and later adapted as
-the _Containment score_ [@ondov_mash_2019].
+first introduced in [@fan_assembly_2015] and later used in the _Mash distance_ [@ondov_mash:_2016] and
+_Containment score_ [@ondov_mash_2019] formulations.
 
 Experiments use $k=\{21, 31, 51\}$
 (except for Mash, which only supports $k \le 32$).
@@ -242,7 +242,8 @@ since `Mash` doesn't support $k>32$.
 **B**: Excluding low coverage genomes identified in previous articles.
 
 All methods are within 1\% of the exact containment on average (Figure \ref{fig:minhash1000} A),
-with `CMash` consistently underestimating the containment for all $k$ values.
+with `CMash` consistently underestimating the containment for large $k$ and overestimating for
+small $k$.
 `Mash Screen` with $n=10000$ has the smallest difference to ground truth for $k=\{21, 31\}$,
 followed by `smol` with `scaled=1000` and `Mash Screen` with $n=1000$.
 
@@ -260,7 +261,7 @@ with all methods mostly within 1\% absolute difference to the ground truth.
 ### Scaled MinHash sketch sizes across GenBank domains
 
 In order to compare sketch sizes across assembled genomes and where they are too
-small to be properly estimate containment and similarity,
+small to properly estimate containment and similarity,
 514,209 assembled genomes (
 5,234 Archaea,
 464,485 Bacteria,
@@ -272,7 +273,7 @@ Figure \ref{fig:scaledGenBankSizesBoxen} show letter-value plots [@hofmann_lette
 The mean sketch size ranges from 20 for viruses to 18,115 for protozoans,
 indicating that _Scaled MinHash_ sketches at $scaled=2000$ are too small for accurate containment and similarity estimates
 in most viral genomes,
-but are large enough for archaea ($mean=767$) and larger organisms.
+but are large enough for archaea (mean sketch size = $767$) and larger organisms.
 
 ```{r scaledGenBankSizesBoxen, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, fig.cap='(ref:genbank-sizes-boxen)', out.width="100%", fig.show="hold", fig.align="center"}
 knitr::include_graphics('../experiments/sizes/figures/sizes.pdf')
@@ -415,7 +416,7 @@ including taxonomic classification in metagenomes and large scale indexing and s
 There are two compatible versions,
 one in Python and another in Rust,
 due to performance requirements when processing large datasets (like metagenomes).
-Both versions of the Scaled MinHash implementations use each language standard library sets
+Both versions of the Scaled MinHash implementations use each language's standard library sets
 (`set` for Python, `HashSet` for Rust)
 for storing hashes and efficient set operations (intersection and difference).
 Experiments used the Rust version for calculating Scaled MinHash sketches,

diff --git a/thesis/02-index.Rmd b/thesis/02-index.Rmd
@@ -38,8 +38,8 @@ to whit: large scaled works well with LCA (small DB, ~tolerable memory, load all
 but low scaled may work (much) better with SBT.
 -->
 
-[@marchet_data_2019] classifies indexing strategies for querying large collections of
-sequencing datasets into _$k$-mer aggregative_ methods and _color aggregative_ methods.
+Indexing strategies for querying large collections of sequencing datasets
+can be classified as _$k$-mer aggregative_ methods and _color aggregative_ methods [@marchet_data_2019].
 _$k$-mer aggregative_ methods index the $k$-mer composition for each individual dataset,
 and then build structures for retrieving the datasets where the query $k$-mers are present.
 _Color aggregative_ methods index the $k$-mer composition for all datasets,
@@ -83,7 +83,7 @@ with network nodes containing a subtree of the original tree and only being
 accessed if the search requires it.
 
 For genomic contexts,
-an hierarchical index is a _$k$-mer aggregative_ method,
+a hierarchical index is a _$k$-mer aggregative_ method,
 with datasets represented by the $k$-mer composition of the dataset and stored in a data structure that allows querying for $k$-mer presence.
 The Sequence Bloom Tree [@solomon_fast_2016] adapts Bloofi for genomics and rephrases the search problem as _experiment discovery_:
 given a query sequence $Q$ and a threshold $\theta$,
@@ -93,7 +93,7 @@ and queries are transcripts.
 
 Further developments of the SBT approach focuses on clustering similar datasets to prune search
 early [@sun_allsome_2017] and developing more efficient representations for the
-internal nodes [@solomon_improved_2017] [@harris_improved_2018] to use less storage space and memory.
+internal nodes [@solomon_improved_2017] and clustering [@harris_improved_2018] to use less storage space and memory.
 
 <!--
 example figure for SBT:
@@ -121,11 +121,11 @@ For efficiency,
 $k$-mers are typically hashed and its integer representation (_hash_) is used
 instead.
 
-kraken [@wood_kraken:_2014] has a special case of this structure,
+`Kraken` [@wood_kraken:_2014] has a special case of this structure,
 using a taxonomic ID (taxon) for representing dataset identity.
 Datasets share the same ID if they belong to the same taxon,
 and if a hash is present in more than one dataset
-kraken reduces the list of taxons to the lowest common ancestor (LCA),
+Kraken reduces the list of taxons to the lowest common ancestor (LCA),
 which lowers memory requirements for storing the index.
 This LCA approach leads to decreased precision and sensitivity over time [@nasko_refseq_2018],
 since new datasets are frequently added to reference databases and the chance of a k-mer being present in multiple datasets increases.
@@ -139,7 +139,7 @@ an alternative to Bloom Filters that also support counting and resizing.
 
 ## Specialized indices for Scaled MinHash sketches
 
-sourmash [@brown_sourmash:_2016] is a software for large-scale sequence data comparisons based on MinHash sketches.
+`sourmash` [@brown_sourmash:_2016] is a software for large-scale sequence data comparisons based on MinHash sketches.
 Initially implementing operations for computing,
 comparing and plotting distance matrices for _MinHash_ sketches,
 version 2 [@pierce_large-scale_2019] introduces _Scaled MinHash_ sketches
@@ -153,7 +153,8 @@ The simplest index is the `LinearIndex`,
 a list of signatures.
 Search operations are executed sequentially,
 and insertions append new signatures to the end of the list.
-Internally sourmash uses LinearIndex as the default index for lists of
+Internally,
+`sourmash` uses LinearIndex as the default index for lists of
 signatures provided in the command-line.
 
 #### MinHash Bloom Tree
@@ -163,8 +164,8 @@ signatures provided in the command-line.
 The _MinHash Bloom Tree_ (_MHBT_) is a variation of the _Sequence Bloom Tree_ (_SBT_)
 that uses Scaled MinHash sketches as leaf nodes instead of Bloom Filters as in
 the SBT.
-The search operation in SBTs is defined as a breadth-first search starting in the root of the tree,
-using a threshold of the original k-mers in the query to decide when to prune the search.
+The search operation in SBTs is defined as a breadth-first search starting at the root of the tree,
+using a threshold of the original $k$-mers in the query to decide when to prune the search.
 MHBTs use a query Scaled MinHash sketch instead,
 but keep the same search approach.
 The threshold of a query $Q$ approach introduced in [@solomon_fast_2016] is
@@ -211,7 +212,7 @@ with containment or similarity to the query calculated against the rebuilt signa
 
 mash screen [@ondov_mash_2019] has a similar index,
 but it is constructed on-the-fly using the distinct hashes in a sketch collection as keys,
-and values are counters initialized set to zero.
+and values are counters initially set to zero.
 As the query is processed,
 matching hashes have their counts incremented,
 and after all hashes in the query are processed then all the sketches in the collection are
@@ -311,21 +312,21 @@ using appropriate queries for each domain.
 All queries were selected from the relevant domain and queried against both MHBT ($scaled=1000$) and LCA ($scaled=10000$),
 for $k=21$.
 
-Table: (\#tab:search-runtime) Running time for similarity search
+Table: (\#tab:search-runtime) Running time in seconds for similarity search
 using LCA ($scaled=10000$) and MHBT ($scaled=1000$) indices.
 
 |           | Viral      | Archaea    | Protozoa   | Fungi        | Bacteria      |
 |:----------|-----------:|-----------:|-----------:|-------------:|--------------:|
 | LCA       | 1.06       | 1.42       | 5.40       | 26.92        | 231.26        |
 | SBT       | 1.32       | 3.77       | 43.51      | 244.77       | 3,185.88      |
 
-Table: (\#tab:search-memory) Memory consumption for similarity search
+Table: (\#tab:search-memory) Memory consumption in megabytes for similarity search
 using LCA ($scaled=10000$) and MHBT ($scaled=1000$) indices.
 
 |           | Viral   | Archaea | Protozoa | Fungi     | Bacteria      |
 |:----------|--------:|--------:|---------:|----------:|--------------:|
-| LCA       | 223,484 | 240,088 | 797,816  | 3,274,388 | 20,925,984    |
-| SBT       | 162,628 | 125,280 | 332,080  | 1,655,788 | 2,289,904     |
+| LCA       |     223 |     240 |     798  |     3,274 | 20,926        |
+| SBT       |     163 |     125 |     332  |     1,656 | 2,290         |
 
 Table \@ref(tab:search-runtime) shows running time for both indices.
 For small indices (Viral and Archaea) the LCA running time is dominated by loading the index in memory,
@@ -376,7 +377,7 @@ especially for operations that need to summarize $k$-mer assignments or require
 
 Due to these characteristics,
 and if memory usage is not a concern,
-then the LCA index is the most appropriate choice since it's faster.
+then the LCA index is the most appropriate choice since it is faster.
 The MHBT index is currently recommended for situations where memory is limited,
 such as with smaller scaled values ($s\le2000$)
 that increase the size of signatures,

diff --git a/thesis/03-gather.Rmd b/thesis/03-gather.Rmd
@@ -29,10 +29,11 @@ $k$-mer assignments matching multiple taxons from a reference database
 with an option to reduce it further to the LCA [@kim_centrifuge_2016].
 
 Once each sequence (from raw reads or assembled contigs) has a taxonomic assignment,
+<!-- David comment: "Not all profiles classify sequences. See Focus, Quikr, Metalign, Metapalette, etc" -->
 these methods resolve the final identity and abundance for each member of the community by summarizing the assignments to a specific taxonomic rank,
 Taxonomic profiling is fundamentally limited by the availability of reference datasets to be used for assignments,
 and reporting what percentage of the sample is unassigned is important to assess results,
-especially in under characterized environments such as oceans and soil.
+especially in undercharacterized environments such as oceans and soil.
 
 ### Decomposition of queries with gather
 
@@ -41,8 +42,12 @@ Methods summarizing taxonomic assignments from sequences in the query metagenome
 starting from the $k$-mer composition of the query,
 it iteratively finds a match in a collection of datasets with the largest _containment_ of the query (most elements in common),
 and create a new query by _removing elements_ in the match from the original query.
-The process stop when the new query doesn't have any more matches in the collection,
+The process stops when the new query doesn't have any more matches in the collection,
 or a user-provided minimum detection threshold is reached.
+<!-- David comment: "I'm surprised this works, since back in 2015 (Metapalette
+days) I found removing elements like this caused the approach to fall apart when
+closely-related organisms are in the metagenome.)
+-->
 This approach differs from previous methods because the co-occurrence of $k$-mers
 in a match is considered a strong signal that they are coming from the same organism in the original sample,
 and is used instead of the LCA-based methods to resolve ambiguities in the taxonomic assignment of a sequence (or its $k$-mers).
@@ -103,8 +108,6 @@ and based on that assignment the extra step associates a taxonomic ID
 and a taxonomic lineage (a path from root to taxonomic ID) derived from a specific taxonomy.
 After a lineage is available,
 each taxonomic rank is summarized based on the abundances under it.
-This information is used to build a profile in the CAMI profiling format,
-but can also be adapted for other uses and tools.
 
 <!-- TODO expand a bit -->
 
@@ -360,6 +363,10 @@ to existing methods.
 Other containment estimation methods described in Chapter [1](#chp-scaled),
 _CMash_ [@koslicki_improving_2019] and _mash screen_ [@ondov_mash_2019],
 can also implement `gather`.
+<!-- David comment: "CMash does kinda, but uses unique k-mers instead of
+removing matches like gather does. CMash commit
+https://github.com/dkoslicki/CMash/commit/de7bdd6fa
+-->
 Running a search requires access to the original dataset (_mash screen_) for the query,
 or a Bloom Filter derived from the original dataset (_CMash_),
 and when the collection of reference sketches is updated the Bloom Filter from _CMash_ can be reused,
@@ -446,6 +453,15 @@ and additional steps must be taken to disambiguate matches.
 The availability of abundance counts for each element in the _Scaled MinHash_ is not well explored,
 since the process of _removing elements_ from the query doesn't account for them
 (the element is removed even if the count is much higher than the count in the match).
+<!-- David comment: could use a compressive sensing approach here:
+
+$ min \norm{x}^2_1 + \lambda \norm{Ax - y}^2_2, x \ge 0$
+
+Y_i = count of hash i in sample
+A_ij = count of hash i in genome j
+
+convert to least squares and use Lawson and Hanson for blistering speed!
+-->
 Both the multiple match as well as the abundance counts issues can benefit from
 existing solutions taken by other methods,
 like the _species score_ (for disambiguation) and _Expectation-Maximization_ (for abundance analysis)

diff --git a/thesis/04-distributed.Rmd b/thesis/04-distributed.Rmd
@@ -16,7 +16,7 @@ since a signature retains a much smaller subset of the original data
 and once a chunk of the data is processed it is not necessary to keep it or refer to it again.
 The calculation process is frequently I/O bound on reading data,
 unless very fast access to the data is available,
-when it becomes CPU-bound on hashing algorithm.
+when it becomes CPU-bound on the hashing algorithm.
 sourmash signatures are also typically small,
 with input sizes in the order of hundreds of times larger than the final signature.
 This creates opportunities for approaches where distributed workers do all the expensive data access,
@@ -96,7 +96,7 @@ depending on what database is being calculated.
 For IMG the data was available in a shared filesystem,
 so `sourmash compute` could access it directly.
 
-For the SRA the worker uses `fastq-dump` (from the `sra toolkit`) to download the data and stream it through a UNIX pipe into a `sourmash compute` process.
+For the SRA, the worker uses `fastq-dump` (from the `sra toolkit`) to download the data and stream it through a UNIX pipe into a `sourmash compute` process.
 Since the data is being streamed,
 it is not being stored locally by the worker,
 so if a signature need to be recalculated the data needs to be downloaded again.
@@ -160,7 +160,7 @@ but not as appropriate for a system that is frequently updated.
 
 `wort` was created to fix some of the issues with the initial `soursigs` prototype
 and allow more computational systems to participate in the process,
-building an heterogeneous system that doesn't depend on similar resources being available for each worker.
+building a heterogeneous system that doesn't depend on similar resources being available for each worker.
 The two major changes were:
 
 - encapsulate workers in a Docker container [@boettiger_introduction_2015],
@@ -234,7 +234,7 @@ It is worth noting that signatures were calculated with a $\mathbf{scaled}=1000$
 but $332/2.4=138$,
 more than expected if considering only final file size.
 Although each signature is using this scaled value,
-there are 3 Scaled MinHash sketches for each dataset (for $k={21,31,51}$),
+there are 3 Scaled MinHash sketches for each dataset (for $k=\{21,31,51\}$),
 and each sketch is also tracking abundance for each hash,
 which doubles the storage requirements.
 
@@ -268,14 +268,14 @@ knitr::include_graphics('figure/wortMetagSizes.png')
 
 448 thousand datasets from the metagenomic subset were processed from 237 TB of original sequencing data,
 totalling 2.1 TB of signatures.
-Building a sourmash index (either revindex or SBT) with this amount of data is
+Building a sourmash index (either LCA or MHBT) with this amount of data is
 prohibitive at the moment,
 since both indices load all the data in memory before writing the indices to storage.
 This is an implementation detail,
 and can be addressed in future versions of sourmash.
 
 In the same vein,
-executing sourmash operations over that many signatures is not an use case that was previously considered,
+executing sourmash operations over that many signatures is not a use case that was previously considered,
 so even operations over a list of signatures (and which don't strictly require indices)
 like `sourmash search` end up requiring too much memory when using hundreds of
 thousands of metagenome signatures.
@@ -318,7 +318,7 @@ If there is access to additional machines (in a cluster, for example)
 it is also possible to parallelize the search even further by splitting the metagenomes into groups,
 and running each group in a distinct machine.
 
-The search found 1,407 of the original 2,631 MAGs above 50% containment, in 2,938 metagenomes
+The search with $k=31$ found 1,407 of the original 2,631 MAGs above 50% containment, in 2,938 metagenomes
 (after removing the Tara Oceans datasets from the results, since the MAGs were derived from them).
 The 50% threshold was chosen to account for sequencing errors and biological variability,
 since Scaled MinHash sketches use exact k-mer matching when calculating containment.

diff --git a/thesis/05-decentralized.Rmd b/thesis/05-decentralized.Rmd
@@ -200,7 +200,7 @@ Signatures can also change if other parameters for calculating their sketches ar
 but having a default set of "good enough" parameters like the ones defined in `wort`
 allows general queries that can later be refined once a candidate subset of the database is found.
 
-```{r sbtInsert, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="60%", fig.cap="Inserting a dataset in an MHBT"}
+```{r sbtInsert, eval=TRUE, echo=FALSE, message=FALSE, error=FALSE, warning=FALSE, cache=TRUE, out.width="70%", fig.align="center", fig.cap="Inserting a dataset in an MHBT"}
 knitr::include_graphics('figure/sbtmh_insert.png')
 ```
 Insertion of new signatures to an MHBT trigger changes in the path from new leaf
@@ -314,7 +314,7 @@ and to measure the overhead that IPFS imposes when used in conditions more simil
 when all the data is available.
 
 Table: (\#tab:sbt-ipfs) Performance of MHBT operations with IPFS storage.
-Units in seconds, with fold-difference to the baseline in parenthesis.
+Units in seconds, with fold-difference to the baseline in parentheses.
 
 | Experiment     | takver | datasilo   | rosewater   |
 |:---------------|-------:|-----------:|------------:|