evodevo_tutorial.Rmd

---
title: "MEFISTO application to evodevo data"
author:
- name: "Britta Velten"
  affiliation: "German Cancer Research Center, Heidelberg, Germany"
  email: "b.velten@dkfz-heidelberg.de"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{MEFISTO on evodevo data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, message=FALSE, warning=FALSE}
library(MOFA2)
library(magrittr)
library(tidyverse)
library(cowplot)
library(reshape2)
```


# Load data
The data for this vignette can be downloaded from [here](https://figshare.com/s/242916198fde3353f3e6) and placed in a `data/` directory. You should then have two files:

- `data/evodevo.RData` containing the expression data used as input
- `data/evodevo_model.hdf5` containing the pre-trained MEFISTO model

We first load the dataframe that contains the gene expression data for all organs (views) and species (groups) as well as the (unmatched) time annotation for each sample.
```{r load_data}
load("data/evodevo.RData")
head(evodevo)
```


# Prepare and train MOFA
*Create the MOFA object*

First, we need to create a MOFA object from this data. This step is analogous to use of MOFA without the time information and can be done using `create_mofa`, which results in an untrained MOFA object.
```{r, message=FALSE, warning=FALSE}
# create the MOFA object and add times as covariate
MOFAobject_untrained <- create_mofa(data = evodevo)
MOFAobject_untrained
```
Next, we want to add the time information for each sample, which we can do using `set_covariates`. As the time is already contained in the data.frame that we passed to the MOFAobject in `create_mofa`, we can just specify to use this column as covariate. Alternatively, we could also supply a new matrix or data frame providing the sample names and covariates. See `?set_covariates`. 
```{r}
head(samples_metadata(MOFAobject_untrained))
MOFAobject_untrained <- set_covariates(MOFAobject_untrained, covariates = "time")
MOFAobject_untrained
```

**Show data and time covariate**
Before moving towards the training, let's first take a look at the data in the untrained object. The plot shows for species (groups) which organ (view) are available for which samples. The right plots show the time values for each sample.
```{r data_overview, fig.height= 10}
gg_input <- plot_data_overview(MOFAobject_untrained,
                               show_covariate = TRUE,
                               show_dimensions = TRUE) 
gg_input
```

**Prepare the MOFA object**
Next, we prepare the model for training. For this we specify the different  options. Similar to the standard MOFA workflow we define data, model and training options. Take a look at `get_default_data_options`, `get_default_model_options` and `get_default_training_options` to see what options can be specified. Here, we specify that we want to use 5 factors and set a seed for reproducibility. 

In addition, for MEFISTO we can specify MEFISTO-specific options, see `get_default_mefisto_options`. Here, we specify that we want to align the times across species using Mouse as a reference group. Furthermore, we can optionally specify time points (when using warping these correspond to times in the reference group), for which to interpolate the factors in all groups.
```{r MOFA_options}
data_opts <- get_default_data_options(MOFAobject_untrained)

model_opts <- get_default_model_options(MOFAobject_untrained)
model_opts$num_factors <- 5

train_opts <- get_default_training_options(MOFAobject_untrained)
train_opts$seed <- 2020
train_opts$convergence_mode <- "medium" # use "fast" for faster training

mefisto_opts <- get_default_mefisto_options(MOFAobject_untrained)
mefisto_opts$warping <- TRUE
mefisto_opts$warping_ref <- "Mouse"
mefisto_opts$new_values <- 1:14 
```

We then pass all these options to the object.
```{r prepare_MOFA, message = FALSE}
MOFAobject_untrained <- prepare_mofa(
  object = MOFAobject_untrained,
  data_options = data_opts,
  model_options = model_opts,
  training_options = train_opts,
  mefisto_options = mefisto_opts
) 
```

**Train the MOFA model**
Now, we are ready to train the model.
```{r}
outfile <- "data/evodevo_model.hdf5"
# MOFAobject <- run_mofa(MOFAobject_untrained, outfile = outfile)
```

As this might take ~15 min, for all further analysis we will use a pre-trained model. In addition to the factor values at the observed data points we can optionally load the interpolated data.
```{r load_model, warning = FALSE}
MOFAobject <- load_model(outfile, load_interpol_Z = TRUE)

# We can add a species column to the metadata
samples_metadata(MOFAobject) %<>% mutate(species = group)
```

# Factor overview and visualization

## Variance decomposition and factor correlation
To obtain a first overview of the factors we can take a look at the variance that a factor explains in each organ and species.
```{r variance_decomposition}
p <- plot_variance_explained(MOFAobject, plot_total = T)
p[[1]] + theme(axis.text.x = element_text(angle = 90))
p[[2]] + theme(axis.text.x = element_text(angle = 90))
```

## Factors versus developmental time (before/after alignement)
To look at the inferred factors, we can plot them against the developmental stages. Before alignment, the timing of common developmental process does not match, after alignment, we see a clearer match.

```{r aligned_vs_unaligned}
# Factors versus developmental time (before alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
                    covariate = "time",  warped = FALSE) 
# Factors versus developmental time (after alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
                    covariate = "time", scale = FALSE)
```

## Interpolation
Using the underlying Gaussian process for each factor we can interpolate to unseen time points for species that are missing data in these time points or intermediate time points. If you want to show uncertainties, set only_mean to FALSE. Note that these are only available if the new values for interpolation have been specified before training.
```{r}
plot_interpolation_vs_covariate(MOFAobject, only_mean = FALSE)

# if the interpolation was not specified in mefisto_options before training
# or you want to interpolate to new values you can use interpolate_factors
# this however does not provide uncertainties
MOFAobject_new <- interpolate_factors(MOFAobject,
                                  new_values = seq(1, 14, 0.1))
plot_interpolation_vs_covariate(MOFAobject_new)
```
 
## Factor scatterplot
A scatterplot of the first two factors shows the embedding of timepoint-species combination in the latent space. Again we see, that the aligned times show a gradual development along the developmental trajectory captured by the conserved developmental programs on Factors 1 and 2.
```{r scatter, fig.width=5, fig.height=4}
plot_factors(MOFAobject, 1:2, color_by = "time_warped") 
plot_factors(MOFAobject, 1:2, color_by = "time") 
plot_factors(MOFAobject, 1:2, color_by = "species")
```

# Alignment
Next, we further look into the alignment that was learnt by the model by plotting the mapping of mouse developmental stages to the stages of other species.

```{r aligned}
plot_alignment(MOFAobject)
```

# Smoothness and sharedness of factors: Investigating the learnt hyper parameters

In addition to the factor values and the alignment the model also inferred an underlying Gaussian process that generated these values. By looking into it we can extract information on the smoothness of each factor, i.e. how smoothly it varies along developmental time, as well as the sharedness of each factor, i.e. how much the species (groups) show the same underlying developmental pattern and how the shape of their developmental trajectory related to a given developmental module (Factor) clusters between species.

## Smoothness score
The scale  parameters of the Gaussian process capture the smoothness of the model. A scale of 1 indicates high smoothness, a scale of 0 variation independent of time. Here all factors show a very high degree of smoothness.
```{r}
get_scales(MOFAobject)
```

For visualization we visualize the scale of the Gaussian process as a smoothness score in a barplot for each factor. No color indicates no smoothness, a fully colored bar a very smooth factor.
```{r smooth, fig.height=0.5, fig.width=5}
plot_smoothness(MOFAobject) 
```

## Sharedness
The group kernel of the Gaussian process can give us insights into the extent to which a temporal pattern is shared across species for the developmental module captured by each process. Here, we see a strong clustering of the first processes, indicating that these are conserved in evolution of the species, while the last two processes show distinct clusters.
```{r}
# We can use the following, below we make some more custom plot
plot_group_kernel(MOFAobject, factors = "all",
                  cellheight =  15, cellwidth = 15,
                  treeheight_col = 3, treeheight_row = 3)
```

For visualization we calculate an overall sharedness score per factor between 0 and 1. No color indicates no sharedness, a fully colored bar a fully shared factor.
```{r shared, fig.height=0.5, fig.width=5}
plot_sharedness(MOFAobject) 
```


# Factor weights: Gene set enrichment analysis and inspection of genes with top weights per factor
## Inspection of top genes per factor
To obtain insights into the molecular signatures associated with each factor in each organ, we can inspect the genes that have high weights on the factors as well as carry out an enrichment analysis to obtain gene sets that are associated to the process.

```{r}
plot_top_weights(MOFAobject, view = "Brain", factors = 1)
```

Let's take a look at these genes over the developmental time.

Factor 1 captures genes with a smooth up/downregulation of expression across all species.
```{r}
plot_data_vs_cov(MOFAobject, factor = 1,
                 view = "Brain",
                 features = 3)
```

Factor 3 captures a sharp expression change in testis.
```{r}
plot_data_vs_cov(MOFAobject, factor = 3,
                 view ="Testis",
                 features = 3)
```

For factor 4, we see that human expression trajectories stand out.
```{r}
plot_data_vs_cov(MOFAobject, factor = 4,
                 view ="Liver",
                 features = 3) 
```

## Gene set enrichment analysis

To aggregate information on factors from individual genes to gene sets we can use enrichment analysis on each factor and for each organ by running `run_enrichment`. We illustrate this here for Brain, for other organs you can do this in an analogous manner.

See our MOFA tutorial on enrichment analysis for details.

```{r}
data("reactomeGS", package = "MOFAdata")
reactomeGS <- reactomeGS[!is.na(rownames(reactomeGS)),]

# we rename the feature to match the gene names in the gene sets
# by removing the organ parts
MOFAobject_renamed <- MOFAobject
features_names(MOFAobject_renamed) <- lapply(features_names(MOFAobject_renamed),
                                             function(l) gsub("_.*","", l))

res <- run_enrichment(MOFAobject_renamed,
    view = "Brain", factors = "all",
    feature.sets = reactomeGS,
    statistical.test = "parametric")

# show top 10 pathways for Factor 1
plot_enrichment(res, factor = 1, max.pathways = 10)
```


<details>
  <summary>**Session Info**</summary>
  
```{r}
sessionInfo()
```

</details>