Skip to content

Transcriptomic analyses

Jonas Andreas Sibbesen edited this page Oct 20, 2020 · 15 revisions

This wiki describes how to use vg rna and related tools for transcriptomic analyses.

Spliced variation graphs

Similar to how genomic variant information can be represented using a variation graph, the splicing structure of a gene can also be represented as a graph.

Here nodes and edges correspond to exons and splice-junctions, respectively. With transcripts represented as paths through the graph. Without the introns and intergenic regions these are also known as splice graphs.

This spliced reference graph can be combined with a variation graph to produce a spliced variation graph containing both the transcriptomic splicing information and genomic variant information.

Paths through can still represent haplotypes and transcripts, but also now haplotype-specific transcripts (not shown above).

Construction

We can use vg rna to construct these spliced variation graphs by adding splice-junctions and optionally transcripts to an existing graph. This can be done using the following command:

vg rna -p -t <threads> -n annotation.[gtf|gff3] graph.pg > spliced_graph.pg

with the hereunder additional options available.

Transcript annotation: vg rna supports both the gtf and gff3 transcript annotation format. Note that all references (column 1) in the annotation must be part of the graph as embedded paths. By default only lines with the exon feature (column 4) will be parsed. This can be changed using --feature-type. In addition, the attribute tag (column 9) that are used as a transcript id/name can be changed using --transcript-tag (default: transcript_id).

Intron database: Besides transcripts, a database of introns can also be added as splice-junctions to a graph. This can be done using the option --introns. The input format is BED with the start and end being the intron boundaries. Note that the strand (column 6) is also needed.

Graph format: vg rna supports any of the handle graph implementations and will use the same format for the graph output as the input. It is, however, recommended that the PackedGraph format is used as it strikes a good balance between memory usage and graph edit speed. A graph can be converted to the PackedGraph format using vg convert -p.

Transcript paths: Reference transcript paths can be added as embedded paths to the graph using --add-ref-paths. Reference transcript paths are transcripts that follow the reference paths defined in the annotation (column 1). See Haplotype-specific transcripts section for more information on projected non-reference transcript paths.

Splice graph: By default vg rna will construct a spliced variation graph that includes the intergenic and intronic regions. If only the exonic regions (splice graph) are of interest this can be changed using --remove-non-gene. Note that all existing embedded paths will be deleted (including the reference). It is therefore recommended that transcript paths are added to the graph (see above).

Haplotype-specific transcripts

More to come soon.

Downstream analyses

All of the standard tools in the vg toolkit also works on spliced variation graphs. However, some tools have been optimized or designed specifically for transcriptomic analyses.

RNA-seq mapping

To map RNA-seq reads to a spliced variation graph we recommend using vg mpmap as it has a mode (-n rna) that has specifically been optimized for RNA-seq data. More information on how to run it with RNA-seq data can be found at the Multipath alignments and vg mpmap wiki page.

Transcript quantification

rpvg can be used to infer the expression of (haplotype-specific) transcript paths. While not specifically part of the vg toolkit it works directly on the output from vg rna and vg mpmap. rpvg takes as input a spliced variation graph, read alignments in either gam or gamp format and a set of transcript paths represented in a GBWT (see Haplotype-specific transcripts section on how to construct this using vg rna). rpvg is able to work on large sets of transcript paths and have successfully been used to infer the expression of 12M haplotype-specific transcripts (constructed from all 5,008 haplotypes in the 1000 genomes project).

Read alignments: It is our experience that the best results are achieved using the default multipath alignment output (gamp) from mpmap as input to rpvg. Also, in order to get more correct probability estimates in rpvg it is recommended that the --remove-bonuses option is used in mpmap.

Clone this wiki locally