-
Notifications
You must be signed in to change notification settings - Fork 195
Haplotype Sampling
When a pangenome graph is used as a reference for read mapping, the variants that are included in the graph affect the results. Roughly speaking,
- variants that are present both in the graph and in the sequenced sample make mapping faster and more accurate; while
- variants that are present in the graph but not in the sample make mapping slower and less accurate.
The usual approach is building a graph with only common variants. Starting with vg version 1.49.0, there is another option: building a personalized reference for each sample. This is achieved by counting kmers in the reads and generating a small number of synthetic haplotypes based on the kmer counts.
vg index -j graph.dist graph.gbz
vg gbwt -p --num-threads 16 -r graph.ri -Z graph.gbz
vg haplotypes -v 2 -t 16 -H graph.hapl graph.gbz
We assume that the graph is based on multiple sequence alignment and has a fairly linear high-level structure. Such graphs can be built using the Minigraph-Cactus pipeline. The graph and the haplotypes must be in GBZ format.
The preprocessing step is done once per graph. It transforms each chromosome into a sequence of subchains (blocks) of target length 10 kbp. The idea is that we can choose the most relevant haplotypes independently in each subchain and concatenate them to form synthetic haplotypes. For each subchain, se select some kmers that are specific to the subchain and represent each haplotype by their presence / absence. At the moment, the kmers are minimizers with a single hit in the graph.
Preprocessing needs three indexes in addition to the GBZ graph: a distance index, an r-index, and a minimizer index.
The distance index can be built using the vg index
subcommand.
The r-index can be built using the vg gbwt
subcommand. Option -p
/ --progress
prints progress information to stderr, while option --num-threads
specifies the number of threads used in index construction.
We build a new minimizer index instead of loading a prebuilt one from disk. Because we do not need the distance index annotations used by vg giraffe
, we can save memory by building a smaller index without them.
We generate the haplotype information using the vg haplotypes
subcommand. There are two required arguments: the input graph that will be preprocessed, and the output file for the haplotype information, specified with option -H
/ --haplotype-output
.
The following options can be used:
-
-d
/--distance-index
: Distance index file. Default:graph.dist
if the input graph isgraph.gbz
. -
-r
/--r-index
: R-index file. Default:graph.ri
if the input graph isgraph.gbz
. -
--kmer-length
: Kmer length used to be used in the haplotype information (up to 31). Default: 29. -
--window-length
: Window length used for building the minimizer index. Default: 11. -
--subchain-length
: Target length (in bp) for subchains. Default: 10000. -
-v
/--verbosity
: Verbosity level (0 = silent, 1 = basic, 2 = detailed). Default: 0. -
-t
/--threads
: Approximate number of threads to be used. Default: depends on the system.
export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp reads.fq.gz ${TMPDIR}/reads $TMPDIR
vg haplotypes -v 2 -t 16 --include-reference \
-i graph.hapl -k ${TMPDIR}/reads.kff -g ${TMPDIR}/sampled.gbz graph.gbz
vg giraffe -p -t 16 -Z ${TMPDIR}/sampled.gbz -i -f reads.fq.gz > reads.gam
Haplotype sampling is based on kmer counts in the reads. Using the counts, the algorithm classifies each kmer used in the haplotype information as absent, heterozygous, present, or frequent in the sequenced sample. It then selects a small number of most relevant haplotypes in each subchain and concatenates them to form synthetic haplotypes.
We assume that the kmer counts are in the KFF format. Any kmer counter supporting the format should work, though we have used only KMC so far. Kmers with a single occurrence in the reads can be safely ignored, and the counts for a kmer and its reverse complement can be combined or separate.
It is important to use the same kmer length in kmer counting as in the haplotype information (default: 29).
TODO: vg haplotypes
command
TODO: Scoring model
The following options can be used:
-
--coverage
: Kmer coverage in the reads. Default: estimate from kmer counts. -
--num-haplotypes
: Number of generated synthetic haplotypes. Default: 4. -
--present-discount
: Multiplicative factor for discounting scores for present kmers. Default: 0.9. -
--het-adjustment
: Additive term for adjusting scores for heterozygous kmers. Default: 0.05. -
--absent-score
: Score for absent kmers. Default: 0.8. -
--include-reference
: Include reference paths and generic paths from the full graph in the sampled graph. -
-v
/--verbosity
: Verbosity level (0 = silent, 1 = basic, 2 = detailed). Default: 0. -
-t
/--threads
: Approximate number of threads to be used. Default: depends on the system.
Option --include-reference
is necessary if the sampled graph is intended for a workflow based on a linear reference genome. That includes mapping reads in the BAM format and calling variants relative to a linear reference genome.
TODO: Let Giraffe build them or use the following. See Giraffe best practices.
vg index -j sampled.dist sampled.gbz
vg minimizer -p -t 16 -o sampled.min -d sampled.dist sampled.gbz
TODO: Alignments to sampled graph are alignments to full graph.