Haplotype Sampling

Overview

When a pangenome graph is used as a reference for read mapping, the variants that are included in the graph affect the results. Roughly speaking,

variants that are present both in the graph and in the sequenced sample make mapping faster and more accurate; while
variants that are present in the graph but not in the sample make mapping slower and less accurate.

The usual approach is building a graph with only common variants. Starting with vg version 1.49.0, there is another option: building a personalized reference for each sample. This is achieved by counting kmers in the reads and generating a small number of synthetic haplotypes based on the kmer counts.

Preprocessing the graph

vg index -j graph.dist graph.gbz
vg gbwt -p --num-threads 16 -r graph.ri -Z graph.gbz
vg haplotypes -v 2 -t 16 -H graph.hapl graph.gbz

...

Required indexes

Generating haplotype information

Haplotype sampling

export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp reads.fq.gz reads $TMPDIR
vg haplotypes -v 2 -t 16 --include-reference \
    -i graph.hapl -k reads.kff -g sampled.gbz graph.gbz
vg index -j sampled.dist sampled.gbz
vg minimizer -p -t 16 -o sampled.min -d sampled.dist sampled.gbz

...