Merge pull request #14 from wariobrega/main

added TF enrichment to the vignette + small text edits on the same vignette
cantinilab · Sep 6, 2024 · bd5abc4 · bd5abc4
2 parents 3e81a81 + e718cfe
commit bd5abc4
Showing 1 changed file with 201 additions and 5 deletions.
diff --git a/docs/source/vignettes/TF enrichment.ipynb b/docs/source/vignettes/TF enrichment.ipynb
@@ -11,15 +11,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook sums up a few guidelines to \n",
+    "This notebook sums up guidelines to: \n",
     "\n",
     "* extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.\n",
     "* perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our [manuscript](https://www.nature.com/articles/s41467-023-43019-2)).\n",
+    "* perform a TF enrichment analysis (using genes) using a collection of TF->Targets.\n",
     "\n",
-    "The  code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C).  This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository.\n",
+    "**NOTE #1:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n",
     "\n",
-    "**NOTE:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n",
-    "**NOTE #2:** The enrichment are performed on human data. Change this code and the databases accordingly if you are working with other species. "
+    "**NOTE #2:** The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species. \n",
+    "\n",
+    "**NOTE #3:** It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it's very case-specific  "
    ]
   },
   {
@@ -133,7 +135,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Motif enrichment "
+    "## Motif enrichment \n",
+    "\n",
+    "This code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C).  This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository."
    ]
   },
   {
@@ -320,6 +324,198 @@
     "    write.csv(enriched_motifs, paste0(out_motif, dim, \".csv\"))\n",
     "}"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "source": [
+    "## TF Enrichment using top genes in mowgli dimensions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "source": [
+    "We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.\n",
+    "\n",
+    "In this case example, we made use of the [Regulatory Circuits](https://doi.org/10.1038/nmeth.3799) database ([link](http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar)), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Reload libraries\n",
+    "\n",
+    "library(stats)\n",
+    "library(dplyr)\n",
+    "library(stringr)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R \n",
+    "\n",
+    "# reload GRN and make a TF-> Target list\n",
+    "\n",
+    "# network of epithelial cells\n",
+    "grn.path <- \"Regulatory_circuits_mammary_epithelial_cell.txt\"\n",
+    "grn <- read.table(grn.path, sep=\"\\t\", header = F)\n",
+    "colnames(grn) <- c(\"TF\", \"Target\", \"score\")\n",
+    "\n",
+    "# make a TF -> Target list\n",
+    "tf_list <- split(grn$Target, grn$TF)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "\n",
+    "# reload top features for RNA and reparse it\n",
+    "top_feats.path <- \"top_genes_in_mowgli.csv\"\n",
+    "top_feats <- read.table(top_feats.path, sep = \",\", header = T)\n",
+    "# set row names to index\n",
+    "row.names(top_feats) <- top_feats$hgnc_symbol\n",
+    "\n",
+    "cols_to_keep <- c(\"highly_variable\", grep(\"top_in_dim\", names(top_feats), value = TRUE))\n",
+    "\n",
+    "top_feats.filtered <- top_feats %>%\n",
+    "  select(all_of(cols_to_keep))\n",
+    "\n",
+    "top_feats.filtered <- top_feats.filtered %>%\n",
+    "  mutate(\n",
+    "    `highly_variable` = as.logical(ifelse(`highly_variable` == \"True\", TRUE, FALSE)),\n",
+    "    across(starts_with(\"top_in_dim\"), ~ as.logical(ifelse(. == \"True\", TRUE, FALSE)))\n",
+    "  )\n",
+    "\n",
+    "# define the universe -> in this case, a list of highly variable genes in the RNA slot\n",
+    "universe <- readLines(\"highly_variable_genes.txt\")\n",
+    "\n",
+    "universe.len <- length(universe)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "source": [
+    "In brief:\n",
+    "- we loop through each dimension and we select for each dimension the top features\n",
+    "- we loop through each TF (using a groupby) and we identify the top sets of features\n",
+    "- we make a hypergeometric test \n",
+    "- we calculate an enrichment score enrichment\n",
+    "- we correct the pvalue using Benjamini-hochberg correction\n",
+    "- we write the results to a file (only for significant TFs enriched)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "r"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# R\n",
+    "\n",
+    "# directory that will store the results\n",
+    "res.dir <- \"tf_enrichment\"\n",
+    "\n",
+    "# colnames of the dataframe storing the results\n",
+    "res.colnames <- c(\"TF\", \"number_of_targets\", \"enrichment_score\", \"p.val\", \"p.adjust\")\n",
+    "\n",
+    "# loop through all the top mowgli dimensions\n",
+    "for (dim in names(top_feats.filtered)[grepl(\"^top_in\", names(top_feats.filtered))]) {\n",
+    "    dim_number <- str_extract(dim, \"\\\\d+$\")\n",
+    "    print(paste(\"enriching:\", dim_number))\n",
+    "    \n",
+    "    # select the top features for that dimensiob\n",
+    "    top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])\n",
+    "    top_genes.len <- length(top_genes) # should always be 100\n",
+    "    ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment\n",
+    "\n",
+    "    # open the output file\n",
+    "    output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
+    "    res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))\n",
+    "    colnames(res.df) <- res.colnames\n",
+    "   \n",
+    "    # define the list to store files \n",
+    "    p_values <- numeric()\n",
+    "    tfs <- character()\n",
+    "    n_targets <- numeric()\n",
+    "    enrichment_scores <- numeric()\n",
+    "\n",
+    "    # loop through all the TFs and perform the enrichment\n",
+    "    for (tf in names(tf_list)){\n",
+    "        targets <- tf_list[[tf]]\n",
+    "        \n",
+    "        x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets\n",
+    "        m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe\n",
+    "        n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets\n",
+    "        k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)\n",
+    "        \n",
+    "        p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched\n",
+    "        n_targets_expressed <- length(intersect(targets, rownames(universe)))\n",
+    "        n_targets_feats <- x/universe.len\n",
+    "        ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)\n",
+    "        enrichment_score <- 1/ (ratioTargets / ratioDim)\n",
+    "        \n",
+    "        # Store results for FDR adjustment\n",
+    "        p_values <- c(p_values, p_value)\n",
+    "        tfs <- c(tfs, tf)\n",
+    "        n_targets <- c(n_targets, x)\n",
+    "        enrichment_scores <- c(enrichment_scores, enrichment_score)\n",
+    "        }\n",
+    "\n",
+    "    # Adjust p-values using Benjamini-Hochberg correction\n",
+    "    adjusted_p_values <- p.adjust(p_values, method = \"BH\")\n",
+    "\n",
+    "    # Combine results into a dataframe\n",
+    "    results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)\n",
+    "    \n",
+    "    # Filter results for significant adjusted p-values\n",
+    "    significant_results <- results[results$adjusted_p_value <= 0.05, ]\n",
+    "    \n",
+    "    # Save significant results to file\n",
+    "    output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n",
+    "    write.table(significant_results, file = output_file_name, sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)\n",
+    "}"
+   ]
   }
  ],
  "metadata": {