Skip to content

Haplotype Sampling

Jouni Siren edited this page Jun 4, 2023 · 28 revisions

Overview

When a pangenome graph is used as a reference for read mapping, the variants that are included in the graph affect the results. Roughly speaking,

  • variants that are present both in the graph and in the sequenced sample make mapping faster and more accurate; while
  • variants that are present in the graph but not in the sample make mapping slower and less accurate.

The usual approach is building a graph with only common variants. Starting with vg version 1.49.0, there is another option: building a personalized reference for each sample. This is achieved by counting kmers in the reads and generating a small number of synthetic haplotypes based on the kmer counts.

Preprocessing the graph

vg index -j graph.dist graph.gbz
vg gbwt -p --num-threads 16 -r graph.ri -Z graph.gbz
vg haplotypes -v 2 -t 16 -H graph.hapl graph.gbz

...

Required indexes

Generating haplotype information

Haplotype sampling

export TMPDIR=/scratch/tmp
kmc -k29 -m128 -okff -t16 -hp reads.fq.gz reads $TMPDIR
vg haplotypes -v 2 -t 16 --include-reference \
    -i graph.hapl -k reads.kff -g sampled.gbz graph.gbz
vg index -j sampled.dist sampled.gbz
vg minimizer -p -t 16 -o sampled.min -d sampled.dist sampled.gbz

...

Kmer counting

Sampling the haplotypes

Index construction

(or just using Giraffe with the GBZ)

Clone this wiki locally