-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathevodevo_tutorial.Rmd
256 lines (203 loc) · 11.2 KB
/
evodevo_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
title: "MEFISTO application to evodevo data"
author:
- name: "Britta Velten"
affiliation: "German Cancer Research Center, Heidelberg, Germany"
email: "[email protected]"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{MEFISTO on evodevo data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, message=FALSE, warning=FALSE}
library(MOFA2)
library(magrittr)
library(tidyverse)
library(cowplot)
library(reshape2)
```
# Load data
The data for this vignette can be downloaded from [here](https://figshare.com/s/242916198fde3353f3e6) and placed in a `data/` directory. You should then have two files:
- `data/evodevo.RData` containing the expression data used as input
- `data/evodevo_model.hdf5` containing the pre-trained MEFISTO model
We first load the dataframe that contains the gene expression data for all organs (views) and species (groups) as well as the (unmatched) time annotation for each sample.
```{r load_data}
load("data/evodevo.RData")
head(evodevo)
```
# Prepare and train MOFA
*Create the MOFA object*
First, we need to create a MOFA object from this data. This step is analogous to use of MOFA without the time information and can be done using `create_mofa`, which results in an untrained MOFA object.
```{r, message=FALSE, warning=FALSE}
# create the MOFA object and add times as covariate
MOFAobject_untrained <- create_mofa(data = evodevo)
MOFAobject_untrained
```
Next, we want to add the time information for each sample, which we can do using `set_covariates`. As the time is already contained in the data.frame that we passed to the MOFAobject in `create_mofa`, we can just specify to use this column as covariate. Alternatively, we could also supply a new matrix or data frame providing the sample names and covariates. See `?set_covariates`.
```{r}
head(samples_metadata(MOFAobject_untrained))
MOFAobject_untrained <- set_covariates(MOFAobject_untrained, covariates = "time")
MOFAobject_untrained
```
**Show data and time covariate**
Before moving towards the training, let's first take a look at the data in the untrained object. The plot shows for species (groups) which organ (view) are available for which samples. The right plots show the time values for each sample.
```{r data_overview, fig.height= 10}
gg_input <- plot_data_overview(MOFAobject_untrained,
show_covariate = TRUE,
show_dimensions = TRUE)
gg_input
```
**Prepare the MOFA object**
Next, we prepare the model for training. For this we specify the different options. Similar to the standard MOFA workflow we define data, model and training options. Take a look at `get_default_data_options`, `get_default_model_options` and `get_default_training_options` to see what options can be specified. Here, we specify that we want to use 5 factors and set a seed for reproducibility.
In addition, for MEFISTO we can specify MEFISTO-specific options, see `get_default_mefisto_options`. Here, we specify that we want to align the times across species using Mouse as a reference group. Furthermore, we can optionally specify time points (when using warping these correspond to times in the reference group), for which to interpolate the factors in all groups.
```{r MOFA_options}
data_opts <- get_default_data_options(MOFAobject_untrained)
model_opts <- get_default_model_options(MOFAobject_untrained)
model_opts$num_factors <- 5
train_opts <- get_default_training_options(MOFAobject_untrained)
train_opts$seed <- 2020
train_opts$convergence_mode <- "medium" # use "fast" for faster training
mefisto_opts <- get_default_mefisto_options(MOFAobject_untrained)
mefisto_opts$warping <- TRUE
mefisto_opts$warping_ref <- "Mouse"
mefisto_opts$new_values <- 1:14
```
We then pass all these options to the object.
```{r prepare_MOFA, message = FALSE}
MOFAobject_untrained <- prepare_mofa(
object = MOFAobject_untrained,
data_options = data_opts,
model_options = model_opts,
training_options = train_opts,
mefisto_options = mefisto_opts
)
```
**Train the MOFA model**
Now, we are ready to train the model.
```{r}
outfile <- "data/evodevo_model.hdf5"
# MOFAobject <- run_mofa(MOFAobject_untrained, outfile = outfile)
```
As this might take ~15 min, for all further analysis we will use a pre-trained model. In addition to the factor values at the observed data points we can optionally load the interpolated data.
```{r load_model, warning = FALSE}
MOFAobject <- load_model(outfile, load_interpol_Z = TRUE)
# We can add a species column to the metadata
samples_metadata(MOFAobject) %<>% mutate(species = group)
```
# Factor overview and visualization
## Variance decomposition and factor correlation
To obtain a first overview of the factors we can take a look at the variance that a factor explains in each organ and species.
```{r variance_decomposition}
p <- plot_variance_explained(MOFAobject, plot_total = T)
p[[1]] + theme(axis.text.x = element_text(angle = 90))
p[[2]] + theme(axis.text.x = element_text(angle = 90))
```
## Factors versus developmental time (before/after alignement)
To look at the inferred factors, we can plot them against the developmental stages. Before alignment, the timing of common developmental process does not match, after alignment, we see a clearer match.
```{r aligned_vs_unaligned}
# Factors versus developmental time (before alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
covariate = "time", warped = FALSE)
# Factors versus developmental time (after alignment)
plot_factors_vs_cov(MOFAobject, color_by = "species",
covariate = "time", scale = FALSE)
```
## Interpolation
Using the underlying Gaussian process for each factor we can interpolate to unseen time points for species that are missing data in these time points or intermediate time points. If you want to show uncertainties, set only_mean to FALSE. Note that these are only available if the new values for interpolation have been specified before training.
```{r}
plot_interpolation_vs_covariate(MOFAobject, only_mean = FALSE)
# if the interpolation was not specified in mefisto_options before training
# or you want to interpolate to new values you can use interpolate_factors
# this however does not provide uncertainties
MOFAobject_new <- interpolate_factors(MOFAobject,
new_values = seq(1, 14, 0.1))
plot_interpolation_vs_covariate(MOFAobject_new)
```
## Factor scatterplot
A scatterplot of the first two factors shows the embedding of timepoint-species combination in the latent space. Again we see, that the aligned times show a gradual development along the developmental trajectory captured by the conserved developmental programs on Factors 1 and 2.
```{r scatter, fig.width=5, fig.height=4}
plot_factors(MOFAobject, 1:2, color_by = "time_warped")
plot_factors(MOFAobject, 1:2, color_by = "time")
plot_factors(MOFAobject, 1:2, color_by = "species")
```
# Alignment
Next, we further look into the alignment that was learnt by the model by plotting the mapping of mouse developmental stages to the stages of other species.
```{r aligned}
plot_alignment(MOFAobject)
```
# Smoothness and sharedness of factors: Investigating the learnt hyper parameters
In addition to the factor values and the alignment the model also inferred an underlying Gaussian process that generated these values. By looking into it we can extract information on the smoothness of each factor, i.e. how smoothly it varies along developmental time, as well as the sharedness of each factor, i.e. how much the species (groups) show the same underlying developmental pattern and how the shape of their developmental trajectory related to a given developmental module (Factor) clusters between species.
## Smoothness score
The scale parameters of the Gaussian process capture the smoothness of the model. A scale of 1 indicates high smoothness, a scale of 0 variation independent of time. Here all factors show a very high degree of smoothness.
```{r}
get_scales(MOFAobject)
```
For visualization we visualize the scale of the Gaussian process as a smoothness score in a barplot for each factor. No color indicates no smoothness, a fully colored bar a very smooth factor.
```{r smooth, fig.height=0.5, fig.width=5}
plot_smoothness(MOFAobject)
```
## Sharedness
The group kernel of the Gaussian process can give us insights into the extent to which a temporal pattern is shared across species for the developmental module captured by each process. Here, we see a strong clustering of the first processes, indicating that these are conserved in evolution of the species, while the last two processes show distinct clusters.
```{r}
# We can use the following, below we make some more custom plot
plot_group_kernel(MOFAobject, factors = "all",
cellheight = 15, cellwidth = 15,
treeheight_col = 3, treeheight_row = 3)
```
For visualization we calculate an overall sharedness score per factor between 0 and 1. No color indicates no sharedness, a fully colored bar a fully shared factor.
```{r shared, fig.height=0.5, fig.width=5}
plot_sharedness(MOFAobject)
```
# Factor weights: Gene set enrichment analysis and inspection of genes with top weights per factor
## Inspection of top genes per factor
To obtain insights into the molecular signatures associated with each factor in each organ, we can inspect the genes that have high weights on the factors as well as carry out an enrichment analysis to obtain gene sets that are associated to the process.
```{r}
plot_top_weights(MOFAobject, view = "Brain", factors = 1)
```
Let's take a look at these genes over the developmental time.
Factor 1 captures genes with a smooth up/downregulation of expression across all species.
```{r}
plot_data_vs_cov(MOFAobject, factor = 1,
view = "Brain",
features = 3)
```
Factor 3 captures a sharp expression change in testis.
```{r}
plot_data_vs_cov(MOFAobject, factor = 3,
view ="Testis",
features = 3)
```
For factor 4, we see that human expression trajectories stand out.
```{r}
plot_data_vs_cov(MOFAobject, factor = 4,
view ="Liver",
features = 3)
```
## Gene set enrichment analysis
To aggregate information on factors from individual genes to gene sets we can use enrichment analysis on each factor and for each organ by running `run_enrichment`. We illustrate this here for Brain, for other organs you can do this in an analogous manner.
See our MOFA tutorial on enrichment analysis for details.
```{r}
data("reactomeGS", package = "MOFAdata")
reactomeGS <- reactomeGS[!is.na(rownames(reactomeGS)),]
# we rename the feature to match the gene names in the gene sets
# by removing the organ parts
MOFAobject_renamed <- MOFAobject
features_names(MOFAobject_renamed) <- lapply(features_names(MOFAobject_renamed),
function(l) gsub("_.*","", l))
res <- run_enrichment(MOFAobject_renamed,
view = "Brain", factors = "all",
feature.sets = reactomeGS,
statistical.test = "parametric")
# show top 10 pathways for Factor 1
plot_enrichment(res, factor = 1, max.pathways = 10)
```
<details>
<summary>**Session Info**</summary>
```{r}
sessionInfo()
```
</details>