Skip to content

Roadmap

Adam Novak edited this page Feb 8, 2021 · 71 revisions

This document sets out the high-level tasks which the vg development team hopes to accomplish in the next few versions of vg and beyond.

By Time

These are the things we hope to achieve on several planning horizons:

Next 3 Months

  • Easy Giraffe #3126
    • Unified Indexing #3144
      • vg index --giraffe #3144 (Adam, Jordan)
      • vg index --mpmap (Adam, Jordan)
      • vg index --map (Adam, Jordan)
      • Less cruft in vg index #3144 (Adam, Jordan)
    • Document this in a way that is under test #3145
    • Low-memory construction of indexes (100 genomes, 100m variants) for Giraffe from extended GFA text (with translation saving) (Jouni)
      • Get working in 200 GB memory (scales with graph node count) for easy/current GBWT build implementation
        • Split by contig and merge?
        • Smart job scheduling to keep in a RAM budget
  • GFA compatibility improvement push
    • Settle how to handle long nodes
      • Chopping overlay (Jordan)
    • Refer people to GetBlunted for non-blunt GFA import.
    • Accept non-numeric ids in GFA import (bonus points for preserving minigraph s<ID> as <ID> internally). (Jordan)
      • Use original s names in GAF
    • Check if text GFA is feasible at HPRC scale (Glenn)
    • Paths and haplotypes from GFA (Adam, Glenn)
      • Convert rGFA tags (SN SO SR...) into paths, for at least rank-0 (the primary reference)
      • Accept GFA-style paths that contain sample and haplotype ids instead of names for haplotype import.
      • Accept extended GFA haplotypes
        • Basic subpath support in vg
      • Invent paths for a GFA that lacks them (Melissa)
  • Translations (Jouni to start)
    • Define and emit translation from chopped graph back to input GFA coordinates, for manual import
    • Saving, loading, and using coordinate translations to/from node-coalesced, string-ID'd input GFA space
      • HG interface and implementation in libbdsg graphs?
    • Back-translate mappings
    • Implicit node chopping on GFA input
    • Keep an eye on rGFA

Research topics: accurate long read Giraffe scaling at Q20: How???

Next 6 Months

  • Full subpath support in vg (Adam, Jordan)
    • HG API support (see old Github issue on handlegraph)
    • Plugging in to tools
  • Implement Distance Index 2, which also works as a snarl manager (Xian)
  • Drop pinchesAndCacti and sonlib
    • Drop Cactus-library-based snarl finder (Adam)
  • Adapt all snarl usage to go through new handle-based API (Adam, Jordan)
    • Shim Snarl* as a non-Protobuf adapter type?
  • Transparently load GFA into HashGraph for any tool that reads a handle graph. (Glenn, Adam)
    • Probably better than a mapped GFA file backed graph
  • Eliminate vg::VG (Jordan)
    • Steal all the things only it can do away from it
  • Default everything to GAF instead of GAM
    • mpGAF (Jordan, Jonas)
  • Long read Giraffe (Xian)

Next Year

  • Instant load/memory mapping
    • For tube map, to enable interactive whole-genome use (Future data vis enthusiast)
    • For Giraffe
    • For graph access from Python via libbdsg
  • Algorithms in libbdsg, available from Python
  • Get GBWT build working in under 200 GB memory on 100m variants with fancy disk-backed in-progress GBWT implementation (need 300m random access vectors that grow independently)
  • Support Erik's multi-level graph format when mature
  • Redesign and reorganize little tools (Where should each manipulation live? Should some just be scripts you write?)
    • vg mod
    • vg chunk
    • vg circularize
    • vg view
    • vg paths

Running Projects

These are things we are working on, with no particular delivery date goal.

  • Use of MCMC techniques in the genotyper with multipath alignments

Wishlist

These are things we would like to do eventually.

  • Alignment
    • Adoption of the multipath alignment paradigm as the default
    • Graph-to-graph mapping (Xian)
  • Variant Calling
    • Implementation of an HHGA-like machine learning based variant caller
    • Integration of variant calling and assembly polishing processes
    • Prune the zoo of TraversalFinders, and expose the useful ones to Python
  • Visualization
    • Browser-free tube map
    • Better tube map handling of edge cases
      • No haplotypes on a node
      • Starting on a rare haplotype
  • Infrastructure
    • Destructively modernize and unify IO
      • Eliminate VPKG framing if possible in favor of magic numbers everywhere
        • Resolve ensuing questions about GAM format
          • Just use GAF?
        • Handle things like GFA that need to manually sniff
      • Just save from the object; no more save_handle_graph
      • Magic format registration for libvgio magic numbers for loading
      • Depend on libvgio in libbdsg to do the IO there and pick the right handle graph implementation
    • Replace Protobuf internal formats with faster ones
    • Revision of ID assignment logic to allow deterministic node breaking
    • Accept gzipped GFA if practical (can't mmap)
    • Improved HandleGraph API
      • Abstract away node boundaries
      • View all sequence as C++17 string_views instead of sequence-owning strings
      • O(1) reverse complement DNAStringView
    • CMake-ify the main vg build
    • Eliminate old systems and their associated submodules, or factor them out into their own projects
      • vg vectorize could be its own project
        • Update vg vectorize to modern, system Vowpal Wabbit
        • Or pull it out into its own submodule and remove Vowpal Wabbit dependency from vg
      • Eliminate RocksDB from vg; everybody using vg map uses GCSA indexes now.
      • vg genotype
      • vg srpe
    • More cross-language support
      • Interoperate with Rust handle graph users/providers
      • Interoperate with Java handle graph users/providers
Clone this wiki locally