Skip to content

suongwong/benj

 
 

Repository files navigation

Benj

Benj utility scripts

Installation

R installation: Use devtools to install

devtools::install_github("kellislab/benj")

Python installation: Use pip

pip install git+https://github.com/kellislab/benj

Conda environments

Environments under conda/ can be used. Mamba is recommended for environment creation because of the number of packages

Luria installation

On Luria, you can put this script in ~/src/conda_init.sh, where you can run source ~/src/conda_init.sh to activate Conda, and then you can run conda activate benj to activate the benj environment.

#!/usr/bin/env bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh" ]; then
	  . "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh"
    else
	  export PATH="/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

RNA integration/subtyping framework

The overall strategy is

  • Load per-sample H5AD from cellranger H5, including velocyto loom. Barcodes are renamed to “Sample#barcode-1” to make multiomic ATAC+GEX analysis easier, as ArchR uses this format.
  • Concatenate all H5AD, add metadata, and run scrublet to remove doublets.
  • Integrate cell type iteratively

H5AD from CellRanger RNA

From a list of sample names and filtered_feature_bc_matrix.h5 files in a file called “items.txt”, you can enter a SLURM script as follows to build H5AD for each:

#!/usr/bin/env bash
#SBATCH -p kellis
#SBATCH --array=1-106%10
#SBATCH --nice=10000
cd /path/to/items/dir
export SAMPLE=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $1 }')
export H5=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $2 }')
export OUTPUT="${SAMPLE}.h5ad"
if [ -f "${OUTPUT}" ]; then
    echo "File ${OUTPUT} already created."
else
    module load lmod-conda
    conda activate pybenj
    `type -P time` h5ad_from_cellranger_rna.py --h5 "${H5}" --sample "${SAMPLE}" --output "${OUTPUT}"
fi

Aggregation from list of H5AD and sample metadata

Now is when sample metadata is included. For every sample in metadata index, the H5AD is read in from the directory.

aggregate_anndata.py --metadata /path/to/md.tsv -d /path/to/h5ad/dir/ -o concatenated.h5ad

Integration process

   function integrate {
	  cls=${1:-overall}
	  shift
	  integrate_rna.py -i concatenated.h5ad \
			   -o "${cls}/${cls}.h5ad" \
			   -t "${cls}/${cls}_clust.tsv.gz" \
			   -f "${cls}" \
			   -b "batch" \
			   -l "${cls}_clust" \
			   --hvg 5000 \
			   --dotplot Mbp Pdgfra Aif1 Gfap Flt1 Pdgfrb Syt1 Slc17a7 Gad1 \
			   "$@"
   }
   integrate overall
   integrate exc -a ./overall_clust/overall_clust.tsv.gz --subset overall_clust=C1,C2,C3 --plot overall_clust
   integrate exc_L23 -a ./exc_clust/exc_clust.tsv.gz --subset exc_clust=C0,C2,C3 --plot overall_clust exc_clust

ATAC integration

The ATAC integration pipeline is very similar to the RNA integration pipeline.

As a brief overview, this pipeline uses both ArchR and muon. ArchR is currently better at QC, peaks, and motifs, and muon is better at primary analysis (cell types, UMAPs, multiomics).

So, the same barcodes (ArchR format, “Sample#barcode-1”) is used to ensure no mixup.

The steps consist of:

Create initial arrow files in R, and add ArchR metadata:

library(ArchR)
addArchRGenome("hg38")
geneAnnotation = benj::createGeneAnnotationGFF("/path/to/cellranger/refdata/genes/genes.gtf", OrgDb=org.Hs.eg.db::org.Hs.eg.db, dataSource="cellranger", organism="Homo sapiens")
ArrowFiles=createArrowFiles(..., geneAnnotation=geneAnnotation)
proj = createArchRProject(ArrowFiles, outputDirectory="ArchR", geneAnnotation=geneAnnotation)
gzf = gzfile("ArchR_metadata.tsv.gz", "w")
write.table(as.data.frame(proj@cellColData), gzf, sep="\t")
close(gzf)
saveArchRProject(proj, "ArchR")

Use muon to count fragments using a peak set

If you don’t have a previous peak annotation, use a tile BED file. Usually the cellranger-arc aggr pipeline provides a good peak BED that is calculated early in the process, so cellranger can be quit after the BED

#!/usr/bin/env bash
#SBATCH -p kellis
#SBATCH --array=1-106%10
#SBATCH --nice=10000
cd /home/benjames/data/SCORCH/1.Mash_BA9_NAc/ATAC
export SAMPLE=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $1 }')
export FRAG=$(< "items.txt" awk -v TASK=${SLURM_ARRAY_TASK_ID} 'NR == TASK { print $2 }')
export META="/path/to/ArchR_metadata.tsv.gz"
export PEAKS="/path/to/atac_peak_annotation.tsv.gz"
export OUTPUT="${SAMPLE}.h5ad"
if [ -f "${OUTPUT}" ]; then
    echo "File ${OUTPUT} already exists"
else
    module load lmod-conda
    conda activate pybenj
    `type -P time` h5ad_from_archr_annotation.py --fragments "${FRAG}" --sample "${SAMPLE}" --cell-metadata "${META}" --peaks "${PEAKS}" --output "${OUTPUT}"
fi

Aggregation from a list of H5AD and sample metadata

Now is when sample metadata is included. For every sample in metadata index, the H5AD is read in from the directory.

aggregate_anndata.py --metadata /path/to/md.tsv -d /path/to/h5ad/dir/ -o concatenated.h5ad

Gene estimation

Using ArchR style gene estimation, except use the peak set instead of tile matrix.

estimate_gene_accessibility -i concatenated.h5ad -o gacc.h5ad --gtf /path/to/cellranger/genes/genes.gtf.gz

Or, if you have annotations already (or after annotations!) you can rank genes and plot:

estimate_gene_accessibility -i concatenated.h5ad -o gacc.h5ad --gtf /path/to/cellranger/genes/genes.gtf.gz --groupby CellType

Integration process

Very similar to RNA integration process. But, use sample level batch correction.

   function integrate {
	  cls=${1:-overall}
	  shift
	  integrate_atac.py -i concatenated.h5ad \
			   -o "${cls}/${cls}.h5ad" \
			   -t "${cls}/${cls}_clust.tsv.gz" \
			   -f "${cls}" \
			   -b "Sample" \
			   -l "${cls}_clust" \
			   "$@"
   }
   integrate overall
   integrate exc -a ./overall_clust/overall_clust.tsv.gz --subset overall_clust=C1,C2,C3 --plot overall_clust
   integrate exc_L23 -a ./exc_clust/exc_clust.tsv.gz --subset exc_clust=C0,C2,C3 --plot overall_clust exc_clust

Subtype peaks/overall re-done peaks

From the integrated ATAC, load in the *_clust.tsv.gz files, and addGroupCoverages in ArchR, then call new peaks. Then, you can iteratively improve the integration.

Multiome workflow

  • Currently, you should process 1) RNA first, using RNA subtyping framework.
  • Then, process ATAC alone as single-omic. But, in the integrate() function, add an annotation for the

Data environment

Standardized single cell data environments allow for easy computation.

This repo works best with the following environment structure per-batch (e.g. from cellranger count):

├── aggr.csv
├── sample_list.txt
├── metrics_summary.tsv
├── ArrowFiles
│   ├── archr_metadata.tsv.gz
│   ├── Sample1.arrow
│   ├── Sample2.arrow
│   ├── Sample3.arrow
│   ├── Sample4.arrow
│   └── Sample5.arrow
├── H5AD
│   ├── cellbender
│   │   ├── Sample1.h5ad
│   │   ├── Sample2.h5ad
│   │   ├── Sample3.h5ad
│   │   ├── Sample4.h5ad
│   │   ├── Sample5.h5ad
│   ├── filtered
│   │   ├── Sample1.h5ad
│   │   ├── Sample2.h5ad
│   │   ├── Sample3.h5ad
│   │   ├── Sample4.h5ad
│   │   ├── Sample5.h5ad
│   └── raw
│       ├── Sample1.h5ad
│       ├── Sample2.h5ad
│       ├── Sample3.h5ad
│       ├── Sample4.h5ad
│       └── Sample5.h5ad
├── Sample1
│   ├── _cmdline
│   ├── _filelist
│   ├── _finalstate
│   ├── _invocation
│   ├── _jobmode
│   ├── _log
│   ├── _mrosource
│   ├── outs
...
├── Sample2
│   ├── _cmdline
│   ├── _filelist
│   ├── _finalstate
│   ├── _invocation
│   ├── _jobmode
│   ├── _log
│   ├── _mrosource
│   ├── outs
...

Checking everything

The script check_batch.sh allows a user to check if a certain batch matches what is expected from this structure to ensure uniform quality.

aggr.csv

aggr.csv is generated by benj::make_aggregation_table() or something similar, where sample name is the 1st column, the atac_fragments are contained in another column if ATAC is included. Other columns should be metadata to include in H5AD per sample, such as tissue or post-mortem interval, or case/control status.

sample_list.txt

List of samples. Ideally used for cellranger-count, but used to keep track of the directories output from count. May not necessarily be the rownames in aggr.csv if you want to rename after counting.

metrics_summary.tsv

Summary of metrics per sample, such as # of reads, # of estimated cells. Computed by:

cellranger_metrics_summary -i ./*/outs/*summary.csv -o metrics_summary.tsv

and can be used, e.g. extracting expected number of cells for cellbender.

H5AD/

H5AD files should be computed using h5ad_from_cellranger_rna.sh, for each of filtered_feature_bc_matrix, raw_feature_bc_matrix, and cellbender_filtered. Note that barcodes will use ArchR format, in the case of multiome data, to allow for shared barcodes to trivially overlap.

You can rename the sample if necessary. Note that velocyto counts will be included by default if exists, and cellranger counts will be aggregated into the cellranger directory if those exist.

For example, if you have the sample names used in cellranger count in sample_list.txt and you have the new names as the rownames in aggr.csv, you can use;

paste <(cat sample_list.txt) <(tail -n+2 <aggr.csv | cut -d , -f 1) | awk -F, '{ print "h5ad_from_cellranger_rna.sh \"$1\" \"$2\"" }' | xargs -P8 sh -c

Or, if they are the same:

< sample_list.txt xargs -P8 h5ad_from_cellranger_rna.sh

CellBender

Speaking of cellbender, if computed, you should place the H5 files into the outs/ subdirectory per counts directory, just like filtered_feature_bc_matrix.h5.

velocyto

Using velocyto is easy now. If you’re in a conda env in the batch directory, you can use:

velocyto_slurmgen.sh -b /path/to/batch/directory | sbatch -p my_queue_name

to generate velocyto counts.

ATAC (ArchR/benj)

For ATAC, ArchR metadata is later used for metadata, and arrow files are needed regardless for secondary analysis. To compute Arrow files for each fragment file, and compute metadata,

mkdir ArrowFiles && pushd ArrowFiles
archr_from_fragments.R -a ../aggr.csv --tss 1 -g /path/to/ArchR_refdata-cellranger-arc-GRCh38-2020-A-2.0.0.rds
popd

Computing environment

Conda

I currently have the file ~/src/conda_init.sh on Luria as:

#!/usr/bin/env bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh" ]; then
        . "/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/etc/profile.d/conda.sh"
    else
        export PATH="/net/bmc-lab5/data/kellis/group/Benjamin/software/miniconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

which should allow anyone with group access to use the benj/pybenj/rbenj environments as updated.

Config files

In ~/.bashrc of BOTH the host and all clients/desktops you want to connect from, set

export JUPYTER_LURIA_PREFIX=110

or some number above 80. This will help if anyone else is using the port number on Luria.

SSH config

The simple SSH config here ~/.ssh/config should keep things simple. Please use an SSH key to make things easy.

Host luria.mit.edu
    HostName 10.159.3.125
    User your_kerb
    PubkeyAcceptedKeyTypes +ssh-rsa

Match exec "echo %h | grep -qE '^b[0-9]+$'"
    HostName %h
    ProxyJump luria.mit.edu
    User your_kerb

SystemD config

If you are on GNU/Linux that uses SystemD, you could use the following per-user service file, ~/.config/systemd/user/[email protected]:

[Unit]
Description=jupyter_luria.sh "%I"
After=network.target

[Service]
ExecStart=%h/.local/bin/jupyter_luria.sh %i
Restart=on-failure
RestartSec=5s
Environment="JUPYTER_LURIA_PREFIX=95"

[Install]
WantedBy=default.target

assuming jupyter_luria.sh is in ~/.local/bin/ (you can find this out with type -P jupyter_luria.sh), and the JUPYTER_LURIA_PREFIX is the same as on the Luria server.

Then, systemctl --user daemon-reload and systemctl --user start jupyter_luria@b3 should connect you to b3.

Launching a job

Then, to launch a Jupyter job on Luria,

sbatch jupyter_luria.sh benj

to launch a jupyter lab in the environment named “benj”.

Look at the hostname of the just-launched job via

squeue -u ${USER}

for example, b3, and connect on your laptop/desktop with:

jupyter_luria.sh b3

Then, you should go in your web browser and connect to https://localhost:XXXX/ where XXXX is the port output by the script.

If you have trouble connecting, try running jupyter lab password in a compute node to use password-based login instead of token-based.

LMod modules

To use, in your ~/.bash_profile, put

module use /path/to/this/repo/modules

and re-login to view changes.

Conda integration is at modules/lmod-conda

To change the default Conda root directory, replace ~/data/miniconda3 with your conda root directory.

Genome files

ENCODE Exclusion list regions

GenomeURL
hg19https://www.encodeproject.org/files/ENCFF001TDO/@@download/ENCFF001TDO.bed.gz
hg38https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
mm10https://www.encodeproject.org/files/ENCFF547MET/@@download/ENCFF547MET.bed.gz

GENCODE GTF

GenomeURL
hg38 GTFhttps://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.annotation.gtf.gz
hg19 GTFhttps://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/GRCh37_mapping/gencode.v43lift37.annotation.gtf.gz
GRCm39 GTFhttps://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/gencode.vM32.annotation.gtf.gz

10X references

RepeatMasker

Download from https://genome.ucsc.edu/cgi-bin/hgTables

Datasets

LinkMD5sum
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5df86844b99161b9487090d91e644745e
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz7635e27373de5dabd5b54ad58a30bc61
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi134a3ca2dc01c398a2905504bd6384f7
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_peak_annotation.tsv38f8abd2ba764e9693869e0111ad7a59
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_filtered_feature_bc_matrix.h5ba0b765eddb138d6d6294227879b9a9b
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_atac_fragments.tsv.gzb1594a4096405128e646e6a275e3ada3
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_atac_fragments.tsv.gz.tbi3054c179689ff025f9e64df6d7a79040
https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_atac_peak_annotation.tsv5c9cde0442444bbc2c4c57c577db6c80

Releases

No releases published

Packages

No packages published

Languages

  • Python 44.0%
  • R 39.5%
  • Shell 10.1%
  • Makefile 5.5%
  • Other 0.9%