Skip to content

Commit

Permalink
Merge pull request #14 from wariobrega/main
Browse files Browse the repository at this point in the history
added TF enrichment to the vignette + small text edits on the same vignette
  • Loading branch information
wariobrega authored Sep 6, 2024
2 parents 3e81a81 + e718cfe commit bd5abc4
Showing 1 changed file with 201 additions and 5 deletions.
206 changes: 201 additions & 5 deletions docs/source/vignettes/TF enrichment.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook sums up a few guidelines to \n",
"This notebook sums up guidelines to: \n",
"\n",
"* extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.\n",
"* perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our [manuscript](https://www.nature.com/articles/s41467-023-43019-2)).\n",
"* perform a TF enrichment analysis (using genes) using a collection of TF->Targets.\n",
"\n",
"The code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C). This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository.\n",
"**NOTE #1:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n",
"\n",
"**NOTE:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n",
"**NOTE #2:** The enrichment are performed on human data. Change this code and the databases accordingly if you are working with other species. "
"**NOTE #2:** The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species. \n",
"\n",
"**NOTE #3:** It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it's very case-specific "
]
},
{
Expand Down Expand Up @@ -133,7 +135,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Motif enrichment "
"## Motif enrichment \n",
"\n",
"This code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C). This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository."
]
},
{
Expand Down Expand Up @@ -320,6 +324,198 @@
" write.csv(enriched_motifs, paste0(out_motif, dim, \".csv\"))\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"source": [
"## TF Enrichment using top genes in mowgli dimensions"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"source": [
"We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.\n",
"\n",
"In this case example, we made use of the [Regulatory Circuits](https://doi.org/10.1038/nmeth.3799) database ([link](http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar)), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# Reload libraries\n",
"\n",
"library(stats)\n",
"library(dplyr)\n",
"library(stringr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R \n",
"\n",
"# reload GRN and make a TF-> Target list\n",
"\n",
"# network of epithelial cells\n",
"grn.path <- \"Regulatory_circuits_mammary_epithelial_cell.txt\"\n",
"grn <- read.table(grn.path, sep=\"\\t\", header = F)\n",
"colnames(grn) <- c(\"TF\", \"Target\", \"score\")\n",
"\n",
"# make a TF -> Target list\n",
"tf_list <- split(grn$Target, grn$TF)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"\n",
"# reload top features for RNA and reparse it\n",
"top_feats.path <- \"top_genes_in_mowgli.csv\"\n",
"top_feats <- read.table(top_feats.path, sep = \",\", header = T)\n",
"# set row names to index\n",
"row.names(top_feats) <- top_feats$hgnc_symbol\n",
"\n",
"cols_to_keep <- c(\"highly_variable\", grep(\"top_in_dim\", names(top_feats), value = TRUE))\n",
"\n",
"top_feats.filtered <- top_feats %>%\n",
" select(all_of(cols_to_keep))\n",
"\n",
"top_feats.filtered <- top_feats.filtered %>%\n",
" mutate(\n",
" `highly_variable` = as.logical(ifelse(`highly_variable` == \"True\", TRUE, FALSE)),\n",
" across(starts_with(\"top_in_dim\"), ~ as.logical(ifelse(. == \"True\", TRUE, FALSE)))\n",
" )\n",
"\n",
"# define the universe -> in this case, a list of highly variable genes in the RNA slot\n",
"universe <- readLines(\"highly_variable_genes.txt\")\n",
"\n",
"universe.len <- length(universe)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "r"
}
},
"source": [
"In brief:\n",
"- we loop through each dimension and we select for each dimension the top features\n",
"- we loop through each TF (using a groupby) and we identify the top sets of features\n",
"- we make a hypergeometric test \n",
"- we calculate an enrichment score enrichment\n",
"- we correct the pvalue using Benjamini-hochberg correction\n",
"- we write the results to a file (only for significant TFs enriched)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "r"
}
},
"outputs": [],
"source": [
"# R\n",
"\n",
"# directory that will store the results\n",
"res.dir <- \"tf_enrichment\"\n",
"\n",
"# colnames of the dataframe storing the results\n",
"res.colnames <- c(\"TF\", \"number_of_targets\", \"enrichment_score\", \"p.val\", \"p.adjust\")\n",
"\n",
"# loop through all the top mowgli dimensions\n",
"for (dim in names(top_feats.filtered)[grepl(\"^top_in\", names(top_feats.filtered))]) {\n",
" dim_number <- str_extract(dim, \"\\\\d+$\")\n",
" print(paste(\"enriching:\", dim_number))\n",
" \n",
" # select the top features for that dimensiob\n",
" top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])\n",
" top_genes.len <- length(top_genes) # should always be 100\n",
" ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment\n",
"\n",
" # open the output file\n",
" output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
" res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))\n",
" colnames(res.df) <- res.colnames\n",
" \n",
" # define the list to store files \n",
" p_values <- numeric()\n",
" tfs <- character()\n",
" n_targets <- numeric()\n",
" enrichment_scores <- numeric()\n",
"\n",
" # loop through all the TFs and perform the enrichment\n",
" for (tf in names(tf_list)){\n",
" targets <- tf_list[[tf]]\n",
" \n",
" x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets\n",
" m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe\n",
" n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets\n",
" k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)\n",
" \n",
" p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched\n",
" n_targets_expressed <- length(intersect(targets, rownames(universe)))\n",
" n_targets_feats <- x/universe.len\n",
" ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)\n",
" enrichment_score <- 1/ (ratioTargets / ratioDim)\n",
" \n",
" # Store results for FDR adjustment\n",
" p_values <- c(p_values, p_value)\n",
" tfs <- c(tfs, tf)\n",
" n_targets <- c(n_targets, x)\n",
" enrichment_scores <- c(enrichment_scores, enrichment_score)\n",
" }\n",
"\n",
" # Adjust p-values using Benjamini-Hochberg correction\n",
" adjusted_p_values <- p.adjust(p_values, method = \"BH\")\n",
"\n",
" # Combine results into a dataframe\n",
" results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)\n",
" \n",
" # Filter results for significant adjusted p-values\n",
" significant_results <- results[results$adjusted_p_value <= 0.05, ]\n",
" \n",
" # Save significant results to file\n",
" output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
" write.table(significant_results, file = output_file_name, sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)\n",
"}"
]
}
],
"metadata": {
Expand Down

0 comments on commit bd5abc4

Please sign in to comment.