-
Notifications
You must be signed in to change notification settings - Fork 196
Haplotype Sampling
Jouni Siren edited this page Jun 4, 2023
·
28 revisions
When a pangenome graph is used as a reference for read mapping, the variants that are included in the graph affect the results. Roughly speaking,
- variants that are present both in the graph and in the sequenced sample make mapping faster and more accurate; while
- variants that are present in the graph but not in the sample make mapping slower and less accurate.
The usual approach is building a graph with only common variants. Starting with vg version 1.49.0, there is another option: building a personalized reference for each sample. This is achieved by counting kmers in the reads and generating a small number of synthetic haplotypes based on the kmer counts.
vg index -j graph.dist graph.gbz
vg gbwt -p --num-threads 16 -r graph.ri -Z graph.gbz
vg haplotypes -v 2 -t 16 -H graph.hapl graph.gbz
...
export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp reads.fq.gz reads $TMPDIR
vg haplotypes -v 2 -t 16 --include-reference \
-i graph.hapl -k reads.kff -g sampled.gbz graph.gbz
vg index -j sampled.dist sampled.gbz
vg minimizer -p -t 16 -o sampled.min -d sampled.dist sampled.gbz
...
(or just using Giraffe with the GBZ)