MEFISTO_preprocessing.Rmd

---
title: "Notes on data pre-processing for the use of MEFISTO"
author:
- name: "Britta Velten"
  affiliation: "German Cancer Research Center, Heidelberg, Germany"
  email: "b.velten@dkfz-heidelberg.de"
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{MEFISTO preprocessing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction
In general, MEFISTO can be used with different likelihood models for each view depending on the nature of each data modality, namely Gaussian (for continuous data), Poisson (for count data) and Bernoulli (for binary data), in the same manner as MOFA2. The implementation of non-Gaussian likelihood models rely on Gaussian approximations to enable a fast variational inference in non-conjugate models following [Seeger & Bouchard](http://proceedings.mlr.press/v22/seeger12.html). 

In many applications however data-specific preprocessing is preferable to take data-characteristics of each view into account and make the use of the Gaussian likelihood models (and its underlying homoscedasticiy assumption) appropriate. In particular, for sequencing count data, data-specific preprocessing should be applied in most cases to correct for technical factors, such as library size, and to remove variance mean relationships in the data. We here provide an overview of some possible approaches for this task, but more recent proposals or normalization methods tailored to a certain data type at hand might exists and could be useful instead.

# General pre-processing
The below examples illustrate some general useful pre-processing steps for count data before applying MOFA2 or MEFISTO.

## Example 1: Bulk RNA-seq data 
Library size correction and variance stabilization is a common procedure to prepare count data for the application of a computational methods that are based on a normality assumption. For this purpose, [DESeq2](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) can be with one of the following transformations. (See also the [corresponding tutorial for details](http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-data-transformations))

First, we do a library size correction.
```{r, message=FALSE, warning=FALSE}
library(DESeq2)
dds <- makeExampleDESeqDataSet()
dds <- estimateSizeFactors(dds)
```

Second, we remove the mean variance trend that is present in the data by one of the following procedures:

* Variance stabilization using `varianceStabilizingTransformation` 
```{r, message=FALSE, warning=FALSE}
# option 1 variance stabilizing transformation
vsd <- varianceStabilizingTransformation(dds)
mefisto_input <- assay(vsd)
```

* regularized logarithmic transformation `rlog` 
```{r, message=FALSE, warning=FALSE}
# option 2: 'regularized log' transformation
rld <- rlog(dds)
mefisto_input <- assay(rld)
```


## Example 2: Sparse single cell count data
* Simplest (and hence widely applied) pre-processing uses a total sum scaling for size factor correction and a shifted logarithmic transformation $x_{ij} = log(K_{ij}/\sum_jK_{ij} + a)$. This is for example the default procedure by [Seurat](https://satijalab.org/seurat/index.html)
```{r, message=FALSE, warning=FALSE}
# see ?NormalizeData
library(Seurat)
data("pbmc_small")
mefisto_input <- NormalizeData(pbmc_small, normalization.method = "LogNormalize", scale.factor = 10000)
```

For Python users, similar pre-processing steps are implemented for example in [`scanpy`](https://scanpy-tutorials.readthedocs.io/en/latest/index.html).

* Pearson residuals from regularized negative binomial regression as proposed in [sctransform](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1)
```{r, message=FALSE, warning=FALSE, eval = FALSE}
# see ?SCTransform
library(Seurat)
data("pbmc_small")
mefisto_input <- SCTransform(object = pbmc_small)
```

* Deviance or Pearson residuals under a multinomial model as proposed by [Townes et al](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6) 


## Example 3: Microbiome data
In principle, similar pre-processing can be used as for RNA-seq data. A detailed tutorial can be found [here](https://bioconductor.org/help/course-materials/2017/BioC2017/Day1/Workshops/Microbiome/MicrobiomeWorkflowII.html). Alternatively, centered log-ratio can be applied, such as implemented as part of [`gemelli`](https://github.com/biocore/gemelli).


# Additional optional steps in the preprocessing
Once the mean-variance relationship was removed from the data, further filtering steps can be useful such as

* filtering to the top N most variable features (can reduce memory requirements and computation time)
* filtering to temporally or spatially variable features (e.g. using [maSigPro](https://www.bioconductor.org/packages/release/bioc/html/maSigPro.html) or [spatialDE](https://github.com/Teichlab/SpatialDE)) (useful if mainly smooth sources of variation or alignment are of interest)
* removal of known covariates using e.g. residuals from a regression model as input to MEFISTO to regress out known covariates

# Detailed use case
Some detailed example of the pre-processing used in the MEFISTO application of the manuscript can be found here:

* Log-transformation for spatial transcriptomics data is integrated in [our R tutorial on spatial transcriptomics data](https://raw.githack.com/bioFAM/MEFISTO_tutorials/master/MEFISTO_ST.html) or [in the corresponding Python tutorial](https://github.com/bioFAM/MEFISTO_tutorials/blob/master/MEFISTO_ST.ipynb)
* [variance stabiliation for RNA-seq data in the evodevo atlas](https://github.com/bioFAM/MEFISTO_analyses/blob/master/evodevo/preprocess_data.R)
* [RCLR transformation for microbiome data](https://github.com/bioFAM/MEFISTO_analyses/blob/master/microbiome/preprocess_data.py)

# SessionInfo
```{r}
sessionInfo()
```