-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnotebook
45 lines (36 loc) · 1.99 KB
/
notebook
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Mon Sep 9 15:18:33 PDT 2019
============================
NSEG is being phased out at NCBI. Replaced NSEG with DUSTMASKER
in the filter-state-1.prl script.
Tue Jun 3 12:52:42 PDT 2008
============================
Fixed problem with filter-stage-1.prl reported by Gyorgy Abrusan.
Fri Feb 8 15:13:12 PST 2008
============================
RepeatScout Changes: Robert Hubley, Institute for Systems Biology
- Wrote new program to replace build_lmer_table ( called elmer ).
This program can handle full genomes for lmers up to 16bp.
- Simplified forward/reverse lmer handling by removing the need
for 2 copies of the input sequence in memory. This change makes
it easier to calculate and enforce sequence boundaries.
- Modified RepeatScout to honor sequence boundaries:
- Continues to load sequence(s) into single concatenated array.
- Created new data structure to hold start positions of
each fasta header.
- Uses seq boundaries in extend_right/extend_left to make
sequences non-active if they run into a boundary.
- Uses seq boundaries in maskextend_right/maskextend_left to
prohibit reduction of frequency of lmers which span boundaries.
WISHLIST:
- Reduce the memory footprint:
- Sequence is in bytes ( 1 byte/base could be reduced to 3 bits/base )
- build_all_pos() creates a linked list of all positions for all
frequent lmers ( default > 3 ) in the input sequence at the start
of the program. This could/should be the function of
build_lmer_table() and for large genomes can be cached to disk.
This would require all RS functions to be able to read from a disk
based hash.
- master sequences ( consensus models ) are kept over the course
of the program. ( not necessary -- flush to disk ).
- Support full custom matrices in addition to the Match/Mismatch based
scoring system.