diff --git a/.DS_Store b/.DS_Store index 1228842c..b51205a3 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/R/DoubletsScores.R b/R/DoubletsScores.R index b427b529..22ef2682 100644 --- a/R/DoubletsScores.R +++ b/R/DoubletsScores.R @@ -220,7 +220,7 @@ addDoubletScores <- function( uwotUmap = uwotUmap, knnMethod = knnMethod, seed = 1, - threads = threads + threads = subThreads ) ################################################# diff --git a/data/.DS_Store b/data/.DS_Store new file mode 100644 index 00000000..5008ddfc Binary files /dev/null and b/data/.DS_Store differ diff --git a/data/Cell-Surface-Genes.rds b/data/Cell-Surface-Genes.rds deleted file mode 100644 index 8ac907af..00000000 Binary files a/data/Cell-Surface-Genes.rds and /dev/null differ diff --git a/vignettes/Articles/tutorial.Rmd b/vignettes/Articles/tutorial.Rmd index 0020a06b..395599e8 100644 --- a/vignettes/Articles/tutorial.Rmd +++ b/vignettes/Articles/tutorial.Rmd @@ -27,11 +27,11 @@ knitr::include_graphics( ) ``` -The following tutorial shows the basics of setting up and interacting with an ArchR Project using a gold-standard dataset of hematopoietic cells ( CITATION ). This tutorial and all of the accompanying vignettes assume that you are running ArchR __locally__. Once all of these steps work for you, feel free to [set up ArchR to work in a cluster environment](articles/Articles/clusterComputing.html). This tutorial does not explain every detail of every step. Please see the [Vignettes section](articles/index.html) for more details on each major analytical step and all of the major features of ArchR. +The following tutorial shows the basics of setting up and interacting with an ArchR Project using a gold-standard downsampled dataset of hematopoietic cells [Granja* et al. Nature Biotechnology 2019](https://www.ncbi.nlm.nih.gov/pubmed/31792411). This tutorial and all of the accompanying vignettes assume that you are running ArchR __locally__. Once all of these steps work for you, feel free to [set up ArchR to work in a cluster environment](articles/Articles/clusterComputing.html). This tutorial does not explain every detail of every step. Please see the [Vignettes section](articles/index.html) for more details on each major analytical step and all of the major features of ArchR. # What is an `ArrowFile` / `ArchRProject`? -The base unit of an analytical project in ArchR is called an `ArrowFile`. Each `ArrowFile`, stores all of the data associated with an individual sample. Here, a sample would be the most detailed unit of analysis desired (for ex. a single replicate of a particular condition). During creation and as additional analyses are performed, ArchR updates and edits each `ArrowFile` to contain additional layers of information. +The base unit of an analytical project in ArchR is called an `ArrowFile`. Each `ArrowFile`, stores all of the data associated with an individual sample (i.e. metadata, accessible fragments and data matrices). Here, a sample would be the most detailed unit of analysis desired (for ex. a single replicate of a particular condition). During creation and as additional analyses are performed, ArchR updates and edits each `ArrowFile` to contain additional layers of information. Then, an `ArchRProject` allows you to associate these `ArrowFiles` together into a single analytical framework. ![](../../images/ArchRProject_Schematic.png){width=700px} @@ -42,14 +42,14 @@ Certain actions can be taken directly on `ArrowFiles` while other actions are ta # Getting Set Up -The first thing we do is set up our working directory, load our genome annotations, and set the number of threads we would like to use. Depending on the configuration of your local environment, you may need to modify the number of `threads` used below. +The first thing we do is set up our working directory, load our genome annotations, and set the number of threads we would like to use. Depending on the configuration of your local environment, you may need to modify the number of `threads` used below in `addArchRThreads`. ```{r eval=FALSE} #Load R Libraries library(ArchR) #Set/Create Working Directory to Folder for Analysis -wd <- "" +wd <- "/Volumes/JG_SSD_2/Data/Analysis/Tutorial/Heme_Tutorial3" dir.create(wd, showWarnings = FALSE, recursive = TRUE) setwd(wd) @@ -61,27 +61,30 @@ geneAnno <- geneAnnoHg19 genomeAnno <- genomeAnnoHg19 #Set Default Threads for ArchR Functions -#By default ArchR uses the total number of cores / 2. +#By default ArchR uses the total number of cores available / 2. If windows this will be set to 1. addArchRThreads() ``` # Creating Arrow Files -For this tutorial, we will download a collection of fragment files. Fragment files are one of the base file types of the 10x Genomics analytical platform and can be easily created from any bam file. See [the ArchR input types vignette](articles/Articles/inputFiles.html) for information on making your own fragment files. Once we have our fragment files, we provide their names as a vector to `createArrowFiles`. During creation, some basic matrices and data is added to each `ArrowFile` including a `TileMatrix` containing insertion counts across genome-wide 500-bp bins. +For this tutorial, we will download a collection of fragment files. Fragment files are one of the base file types of the 10x Genomics analytical platform (and others) and can be easily created from any bam file. See [the ArchR input types vignette](articles/Articles/inputFiles.html) for information on making your own fragment files. Once we have our fragment files, we provide their names as a character vector to `createArrowFiles`. During creation, some basic matrices and data is added to each `ArrowFile` including a `TileMatrix` containing insertion counts across genome-wide 500-bp bins (see `addTileMatrix`) and a `GeneScoreMatrix` that is determined based on weighting insertion counts in tiles nearby a gene promoter (see `addGeneScoreMatrix`). ```{r eval=FALSE} -#Get Tutorial Data ~2.2GB To Download (if downloaded already ArchR will bypass downloading) +#Get Tutorial Data ~2.2GB To Download (if downloaded already ArchR will bypass downloading). inputFiles <- getTutorialData("Hematopoiesis") -#Create Arrow Files (~10-15 minutes) -#It is important +#Create Arrow Files (~10-15 minutes) w/ helpful messages displaying progress. +#This step will for each sample : +# 1. Read Accessible Fragments. +# 2. Identify Cells QC Information (TSS Enrichment, Nucleosome info). +# 3. Filter Cells based on QC parameters. +# 4. Create a TileMatrix 500-bp genome-wide. +# 5. Create a GeneScoreMatrix. ArrowFiles <- createArrowFiles( inputFiles = inputFiles, sampleNames = names(inputFiles), geneAnno = geneAnno, - genomeAnno = genomeAnno, - threads = threads, - force = FALSE + genomeAnno = genomeAnno ) ``` @@ -91,9 +94,10 @@ One major source of trouble in single-cell data is the contribution of "doublets ```{r eval=FALSE} #Add Infered Doublet Scores to each Arrow File (~5-10 minutes) -doubScores <- addDoubletScores(ArrowFiles, threads = threads) +doubScores <- addDoubletScores(ArrowFiles) #Create ArchRProject +#The outputDirectory here describes where all downstream analyses and plots go. proj <- ArchRProject( ArrowFiles = ArrowFiles, geneAnnotation = geneAnno, @@ -102,6 +106,9 @@ proj <- ArchRProject( ) #Filter Doublets +#The automatic filtering rate will be based on how many cells are in the sample, if there +#are 5,000 cells ArchR will remove up to 250 (~5%) of the cells. If you believe more cells +#should be excluded change the filterRatio argument apropriately. proj <- filterDoublets(proj) ``` @@ -114,8 +121,7 @@ At this point, we have an ArchR project that is ready to be used in downstream v proj <- addIterativeLSI( ArchRProj = proj, useMatrix = "TileMatrix", - reducedDimsOut = "IterativeLSI", - threads = threads + reducedDimsOut = "IterativeLSI" ) #Identify Clusters from Iterative LSI @@ -152,6 +158,22 @@ This [plot](../../images/tutorial_1_UMAP-Clusters.pdf) shows gene experimental s ![Alt](../../images/tutorial_1_UMAP-Clusters.pdf){width=450 height=450}
The following tutorial shows the basics of setting up and interacting with an ArchR Project using a gold-standard downsampled dataset of hematopoietic cells Granja* et al. Nature Biotechnology 2019. This tutorial and all of the accompanying vignettes assume that you are running ArchR locally. Once all of these steps work for you, feel free to set up ArchR to work in a cluster environment. This tutorial does not explain every detail of every step. Please see the Vignettes section for more details on each major analytical step and all of the major features of ArchR.
+ArrowFile
/ ArchRProject
?The base unit of an analytical project in ArchR is called an ArrowFile
. Each ArrowFile
, stores all of the data associated with an individual sample (i.e. metadata, accessible fragments and data matrices). Here, a sample would be the most detailed unit of analysis desired (for ex. a single replicate of a particular condition). During creation and as additional analyses are performed, ArchR updates and edits each ArrowFile
to contain additional layers of information. Then, an ArchRProject
allows you to associate these ArrowFiles
together into a single analytical framework.
Certain actions can be taken directly on ArrowFiles
while other actions are taken on an ArchRProject
which in turn updates each associated ArrowFile
. Because ArrowFiles
are stored as large HDF5-format files, “get-er” functions in ArchR retrieve data by interacting with the ArchRProject
.
The first thing we do is set up our working directory, load our genome annotations, and set the number of threads we would like to use. Depending on the configuration of your local environment, you may need to modify the number of threads
used below in addArchRThreads
.
#Load R Libraries
+library(ArchR)
+
+#Set/Create Working Directory to Folder for Analysis
+wd <- "/Volumes/JG_SSD_2/Data/Analysis/Tutorial/Heme_Tutorial3"
+dir.create(wd, showWarnings = FALSE, recursive = TRUE)
+setwd(wd)
+
+#Load Genome Annotations. Available annotations are for Hg19, Hg38, Mm9, or Mm10.
+#(Note if you want to build a custom annotation see createGeneAnnotation or createGenomeAnnotation).
+data("geneAnnoHg19")
+data("genomeAnnoHg19")
+geneAnno <- geneAnnoHg19
+genomeAnno <- genomeAnnoHg19
+
+#Set Default Threads for ArchR Functions
+#By default ArchR uses the total number of cores available / 2. If windows this will be set to 1.
+addArchRThreads()
For this tutorial, we will download a collection of fragment files. Fragment files are one of the base file types of the 10x Genomics analytical platform (and others) and can be easily created from any bam file. See the ArchR input types vignette for information on making your own fragment files. Once we have our fragment files, we provide their names as a character vector to createArrowFiles
. During creation, some basic matrices and data is added to each ArrowFile
including a TileMatrix
containing insertion counts across genome-wide 500-bp bins (see addTileMatrix
) and a GeneScoreMatrix
that is determined based on weighting insertion counts in tiles nearby a gene promoter (see addGeneScoreMatrix
).
#Get Tutorial Data ~2.2GB To Download (if downloaded already ArchR will bypass downloading).
+inputFiles <- getTutorialData("Hematopoiesis")
+
+#Create Arrow Files (~10-15 minutes) w/ helpful messages displaying progress.
+#This step will for each sample :
+# 1. Read Accessible Fragments.
+# 2. Identify Cells QC Information (TSS Enrichment, Nucleosome info).
+# 3. Filter Cells based on QC parameters.
+# 4. Create a TileMatrix 500-bp genome-wide.
+# 5. Create a GeneScoreMatrix.
+ArrowFiles <- createArrowFiles(
+ inputFiles = inputFiles,
+ sampleNames = names(inputFiles),
+ geneAnno = geneAnno,
+ genomeAnno = genomeAnno
+)
ArchRProject
One major source of trouble in single-cell data is the contribution of “doublets” to the analysis. A doublet refers to a single droplet that received a single barcoded bead and more than one nucleus. This causes the reads from more than one cell to appear as a single cell. We remove these computationally and describe this doublet removal process in more depth in the doublet removal vignette.
+#Add Infered Doublet Scores to each Arrow File (~5-10 minutes)
+doubScores <- addDoubletScores(ArrowFiles)
+
+#Create ArchRProject
+#The outputDirectory here describes where all downstream analyses and plots go.
+proj <- ArchRProject(
+ ArrowFiles = ArrowFiles,
+ geneAnnotation = geneAnno,
+ genomeAnnotation = genomeAnno,
+ outputDirectory = "Heme_Tutorial"
+)
+
+#Filter Doublets
+#The automatic filtering rate will be based on how many cells are in the sample, if there
+#are 5,000 cells ArchR will remove up to 250 (~5%) of the cells. If you believe more cells
+#should be excluded change the filterRatio argument apropriately.
+proj <- filterDoublets(proj)
At this point, we have an ArchR project that is ready to be used in downstream visualizations and analyses. The first thing we will do is use an iterative latent semantic indexing (LSI) approach to define clusters in our data. Once we have identified clusters in our data, we can plot a UMAP embedding. For more details, see the dimensionality reduction vignette.
+#Reduce Dimensions with Iterative LSI (~5-10 minutes)
+proj <- addIterativeLSI(
+ ArchRProj = proj,
+ useMatrix = "TileMatrix",
+ reducedDimsOut = "IterativeLSI"
+)
+
+#Identify Clusters from Iterative LSI
+#The larger the resolution the more clusters will be called. The lower the resolution hte less clusters will be called.
+#It is recommended to compare the results from your clusters and your embeddings and find params that best agree across
+#both analyses for clarity.
+proj <- addClusters(input = proj, reducedDims = "IterativeLSI", resolution = 0.6)
+
+#Add Imputation Weights for imputing numerical values based on Magic (see van Dijk et. al. 2018).
+proj <- addImputeWeights(ArchRProj = proj)
+
+#Compute a UMAP embedding to visualize our tiled accessibility matrix in a 2-d setting.
+proj <- addEmbedding(
+ ArchRProj = proj,
+ reducedDims = "IterativeLSI",
+ embedding = "UMAP",
+ embeddingParams = list(min_dist = 0.4),
+ force = TRUE
+)
+
+#Plot the UMAP Embedding with Metadata Overlayed such as Experimental Sample and Clusters.
+#To change plotting aesthetics see ?plotEmbedding parameters.
+plotList <- list()
+plotList[[1]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Sample")
+plotList[[2]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Clusters", plotParams = list(labelMeans=TRUE))
+plotPDF(plotList = plotList, name = "UMAP-Samples-Clusters", width = 6, height = 6, ArchRProj = proj)
To add your own information for plotting ontop of UMAP embedding we will show an example here.
+proj <- addCellColData(ArchRProj = proj, data = , cellNames = )
+
+#Plot the UMAP Embedding with Metadata Overlayed such as Experimental Sample and Clusters.
+#To change plotting aesthetics see ?plotEmbedding parameters.
+plotList <- list()
+plotList[[1]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Sample")
+plotList[[2]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Clusters", plotParams = list(labelMeans=TRUE))
+plotPDF(plotList = plotList, name = "UMAP-Samples-Clusters", width = 6, height = 6, ArchRProj = proj)
In order to understand which clusters correspond to which cell types, we use a supervised approach based on prior knowledge of the genes that are active in specific cell types. We determine gene activity scores for each putative marker gene based on chromatin accessibility signal in the region surrounding the gene’s promoter. We can then overlay these gene activity scores on our UMAP embedding to visualize the relationship between gene activity and cluster. For more details, see the marker genes vignette.
+markerGenes <- c(
+ "CD34", #Early Progenitor
+ "GATA1", #Erythroid
+ "PAX5", "MS4A1", #B-Cell Trajectory
+ "CD14", #Monocytes
+ "CD3D", "CD8A", "TBX21", "IL7R" #TCells
+ )
+
+#Plot the UMAP Embedding with Marker Genes Overlayed
+plotList <- list()
+plotList[[1]] <- plotEmbedding(ArchRProj = proj, colorBy = "GeneScoreMatrix", name = markerGenes)
+plotPDF(plotList = plotList, name = "UMAP-Marker-Gene-Scores", width = 6, height = 6, ArchRProj = proj)
+
+#Plot Tracks at Marker Genes
+plotTracks <- ArchRRegionTrack(ArchRProj = proj, threads = threads, geneSymbol = markerGenes)
+plotPDF(plotList = plotTracks, name = "Tracks-Marker-Genes", width = 6, height = 8, ArchRProj = proj)
+
+#Identify Marker Gene through Pairwise Test vs Null
+markersGS <- markerFeatures(ArchRProj = proj, threads = threads, useMatrix = "GeneScoreMatrix")
+heatmapGS <- markerHeatmap(
+ seMarker = markersGS,
+ cutOff = "FDR <= 0.01 & Log2FC >= 1",
+ labelMarkers = markerGenes
+)
+plotPDF(heatmapGS, name = "GS-Marker-Heatmap", width = 8, height = 12, ArchRProj = proj)
One of the most complicated aspects about ATAC-seq and scATAC-seq analysis is the generation of a reproducible and robust peak set. In ArchR, we use an iterative overlap removal process that we first described in Corces* & Granja* et al. Science 2018. This process is described in detail in the peak calling vignette.
+To robustly call peaks, we first merge the sparse single-cell data into pseudo-bulk replicates by aggregating the insertions from many individual cells into a single group. We make multiple pseudo-bulk replicates for each cluster to enable an assessment of peak reproducibility. This process of pseudo-bulk generation is described in detail in the pseudo-bulk generation vignette. We than call peaks using MACS2 and perform our iterative overlap removal. Once we obtain a finalized peak set, we collect insertion counts in each peak for each single cell and associate this with the corresponding ArrowFile
via the ArchRProject
.
#Create Group Coverage Files that can be used for downstream analysis (~5-10 minutes)
+proj <- addGroupCoverages(ArchRProj = proj, groupBy = "Clusters", threads = threads)
+
+#Call Reproducible Peaks w/ Macs2 (~5-10 minutes)
+proj <- addReproduciblePeakSet(ArchRProj = proj, groupBy = "Clusters", threads = threads)
+
+#Add Peak Matrix
+proj <- addPeakMatrix(ArchRProj = proj, threads = threads, force = TRUE)
Often times, we are interested to know which peaks are unique to an individual cluster or a small group of clusters. We can do this in an unsupervised fashion in ArchR:
+#Identify Marker Peaks
+markersPeaks <- markerFeatures(ArchRProj = proj, threads = threads, useMatrix = "PeakMatrix")
+heatmapPeaks <- markerHeatmap(
+ seMarker = markersPeaks,
+ cutOff = "FDR <= 0.01 & Log2FC >= 1"
+)
+plotPDF(heatmapPeaks, name = "Peak-Marker-Heatmap", width = 8, height = 12, ArchRProj = proj)
Using our tutorial data, your marker peak heatmap should look like this.
+QQQQQQQ - I think the concept of TF deviations is still abstract to most people. Does ArchR do motif enrichment with hypergeometric test as well??
+Using the reproducible peak set that we defined above, we can use ArchR to calculate TF deviations on a single-cell basis for transcription factors in the peaks identified in each cluster. We can then overlay these deviations on on UMAP embedding. This effectively infers differences in TF activity across all single cells and is very useful in identifying regulatory factors governing cell fate.
+#Motif Search in PeakSet
+proj <- addMotifAnnotations(ArchRProj = proj, name = "Motif")
+
+#Add chromVAR Deviations (~20-25 min if using CisBP MotifSet)
+proj <- addDeviationsMatrix(ArchRProj = proj, threads = threads)
+
+#Identify Variable TFs
+head(getVarDeviations(proj, plot = FALSE)$name, 25)
+
+#To access motif need to specify deviations,z : motif_name
+#Try getFeatures with MotifMatrix to see available names
+getFeatures(proj, select = "PAX5", useMatrix = "MotifMatrix")
+
+#Define the list of motifs to plot
+motifs <- c("GATA1_383", "CEBPA_155", "EBF1_67","IRF4_632","TBX21_780","PAX5_709")
+
+#Plot the UMAP Embedding with chromVAR Deviations Overlayed
+plotList <- list()
+plotList[[1]] <- plotEmbedding(ArchRProj = proj, colorBy = "colData", name = "Clusters")
+plotList[[2]] <- plotEmbedding(ArchRProj = proj, colorBy = "MotifMatrix", name = paste0("z:",motifs))
+plotPDF(plotList = plotList, name = "Plot-UMAP-TileLSI-MotifMatrix", width = 6, height = 6, ArchRProj = proj)
Using our tutorial data, your marker peak heatmap should look like this.
+Transcription factor footprinting can also be done in ArchR with a single command. We note that the footprints generated by the tutorial data are not as clean as would be desired but this is because of the small size of the tutorial dataset.
+ +