This repository has been archived by the owner on Jul 18, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
44f706d
commit e0cf175
Showing
1 changed file
with
154 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
# AwesomeAssembler | ||
|
||
CI Status: | ||
|
||
master:![image](https://travis-ci.com/fxia22/HINGE.svg?token=i41xfGcHb72GYFyZnvtg&branch=master) | ||
|
||
dev:![image](https://travis-ci.com/fxia22/HINGE.svg?token=i41xfGcHb72GYFyZnvtg&branch=dev) | ||
|
||
## Introduction | ||
|
||
AwesomeAssembler is an experimental long read assembler based on an idea called _hinging_. Now AwesomeAssembler is at research prototype stage. | ||
|
||
## Pipeline Overview | ||
|
||
AwesomeAssembler is an OLC(Overlap-Layout-Consensus) assembler. The idea of the pipeline is shown below. | ||
|
||
![image](High_level_overview.png) | ||
|
||
At a high level, the algorithm can be thought of a variation of the classical greedy algorithm. | ||
The main difference with the greedy algorithm is that rather than each read having a single successor, | ||
and a single predecessor, we allow a small subset of reads to have a higher number of successors/predecessors. | ||
This subset is identified by a process called _hinging_. This helps us to recover the graph structure | ||
directly during assembly. | ||
|
||
Another significant difference from HGAP or Falcon pipeline is that it does not have a pre-assembly or read correction step. | ||
|
||
|
||
|
||
## Algorithm Details | ||
|
||
### Reads filtering | ||
Reads filtering filters reads that have long chimer in the middle, and short reads. | ||
Reads which can have higher number of predecessors/successors are also identified there. | ||
This is implemented in `filter/filter.cpp` | ||
|
||
### Layout | ||
The layout is implemented in `layout/hinging.cpp`. It is done by a variant of the greedy algorithm. | ||
|
||
The graph output by the layout stage is post-processed by running `scripts/pruning_and_clipping.py`. | ||
One output is a graphml file which is the graph representation of the backbone. | ||
This removes dead ends and Z-structures from the graph enabling easy condensation. | ||
It can be analyzed and visualized, etc. | ||
|
||
|
||
## Parameters | ||
|
||
In the pipeline described above, most programs not only takes the input file and output file as arguments, but also require a configuration file in ini format. This consists parameters for each step in the pipeline, and their usage and effects are explained below: | ||
|
||
|
||
###[filter] | ||
- length_threshold = 6500; // Length threshold for reads to be considered in the backbone | ||
- quality_threshold = 0.23; // Quality threshold for edges to be considered in the backbone | ||
- n_iter = 2; // iterations of filtering, the filtering needs several iterations, because when filter reads, you got rid of some edges; when filter edges, you got rid of some reads (if the last edge is filtered.) Typically 2-3 iterations will be enough. | ||
- aln_threshold = 2500; // Length of alignment for edges to be considered in the backbone | ||
- min_cov = 5; // Minimal coverage for a segment to be considered not chimer/adaptor | ||
- cut_off = 200; // A parameter for identifying long chimer in the middle of a read | ||
- theta = 300; // A parameter for tolerance of the overhang length when looking for right extension. | ||
|
||
|
||
###[running] | ||
- n_proc = 12; // number of CPUs for layout step | ||
|
||
###[draft] | ||
- min_cov = 10; //obsolete | ||
- trim = 200; //obsolete | ||
- edge_safe = 100; //obsolete | ||
- tspace = 900; //space between new "trace points" | ||
|
||
|
||
###[consensus] | ||
- min_length = 2000; // Minimal length of reads used for final consensus | ||
- trim_end = 200; // Trim ends for alignments for final consensus | ||
- best_n = 1; // If one read has multiple alignments with the bacbone assembly, choose the longest n segments for consensus. | ||
- quality_threshold = 0.23; // alignment quality threshold | ||
|
||
# Installation | ||
|
||
This software is still at prototype stage so it is not well packaged, however it is designed in a modular flavor so different combinations of methods can be tested. | ||
|
||
Installing the software is very easy. | ||
|
||
``` | ||
git clone https://github.com/Eureka22/AwesomeAssembler.git | ||
git submodule init | ||
git submodule update | ||
./build.sh | ||
``` | ||
|
||
# Running | ||
|
||
In order to call the programs from anywhere, I suggest one export the directory of binary file to system environment, you can do that by using the script `setup.sh`. | ||
|
||
A demo run for assembling the ecoli genome is the following: | ||
|
||
``` | ||
source setup.sh | ||
mkdir data/ecoli | ||
cd data/ecoli | ||
# reads.fasta should be in data/ecoli | ||
fasta2DB ecoli reads.fasta | ||
DBsplit -x500 -s100 ecoli | ||
HPCdaligner -t5 ecoli | csh -v | ||
# alternatively, you can put output of HPCdaligner to a bash file and edit it to support | ||
rm ecoli.*.ecoli.* | ||
LAmerge ecoli.las ecoli.*.las | ||
rm ecoli.*.las # we only need ecoli.las | ||
DASqv -c100 ecoli ecoli.las | ||
# Run filter | ||
Reads_filter --db ecoli --las ecoli.las -x ecoli --config /utils/nominal.ini | ||
# Run layout | ||
hinging --db ecoli --las ecoli.las -x ecoli --config /utils/nominal.ini -o ecoli | ||
# Run postprocessing | ||
python pruning_and_clipping.py ecoli.edges.hinges ecoli.hinge.list <identifier-of-run> | ||
``` | ||
|
||
## Debugging | ||
|
||
### showing ground truth on graph | ||
Some programs are for debugging and oberservation. For example, one can get the ground truth by mapping reads to reference and get `ecoli.ecoli.ref.las`. | ||
|
||
This `las` file can be parsed to json file for other programs to use. | ||
|
||
``` | ||
run_mapping.py ecoli ecoli.ref ecoli.ecoli.ref.las 1-$ | ||
``` | ||
|
||
In the prune step, if `ecoli.mapping.json` exists, the output `graphml` file will contain the information of ground truth. | ||
|
||
### drawing alignment graphs and mapping graphs | ||
Draw a read, for example 60947, and output figure to `sample` folder (need plus 1 as LAshow counts from 1): | ||
|
||
``` | ||
draw2.py ecoli ecoli.las 60948 sample 100 | ||
``` | ||
|
||
Draw pileup on draft assembly, given a region(start,end): | ||
|
||
``` | ||
draw2_pileup_region.py 3600000 4500000 | ||
``` | ||
|
||
# Results: | ||
|
||
For ecoli 160X dataset, after shortening reads to have a mean length of 3500 (with a variance of 1500), the graph is preserved. | ||
|
||
![image](ecoli_shortened.png) | ||
|
||
The graph returned by Falcon here is | ||
|
||
![image](Falcon_ecoli_shortened.png) |