This repository contains the files to build and archive genome assets to serve with refgenieserver at http://rg.databio.org.
The whole process is scripted, starting from this repository. From here, we do this basic workflow:
- Download raw input files for assets (FASTA files, GTF files etc.)
- Configure refgenie
- Build assets with
refgenie build
in a local refgenie instance - Archive assets with
refgenieserver archive
- Upload archives to S3
- Deploy assets to active server on AWS.
The metadata is located in the asset_pep folder, which contains a PEP with metadata for each asset. The contents are:
assets.csv
- The primary sample_table. Each each row is an asset.recipe_inputs.csv
- The subsample_table. This provides a way to define each individual value passed to any of the 3 arguments of therefgenie build
command:--assets
,--params
, and--files
.refgenie_build_cfg.yaml
-- config file that defines a subproject (which is used to download the input data) and additional project settings.
To add an asset, you will need to add a row in assets.csv
. Follow these directions:
genome
- the human-readable genome (namespace) you want to serve this asset underasset
- the human-readble asset name you want to serve this asset under. It is identical to the asset recipe. Userefgenie list
to see available recipes
Your asset will be retrievable from the server with refgenie pull {genome}/{asset_name}
.
Next, we need to add the source for each item required by your recipe. You can see what the recipe requires by using -q
or --requirements
, like this: refgenie build {genome}/{recipe} -q
. If your recipe doesn't require any inputs, then you're done. If it requires any inputs (which can be one or more of the following: assets, files, parameters), then you need to specify these in the recipe_inputs.csv
table.
For each required input, you add a row to recipe_inputs.csv
. Follow these directions:
sample_name
- must match thegenome
andasset
value in theassets.csv
file. Format it this way:<genome>-<asset>
. This is how we match inputs to assets.
Next you will need to fill in 3 columns:
input_type
which is one of the following: files, params or assetsintput_id
must match the recipe requirement. Again, userefgenie build <genome>/<asset> -q
to learn the idsinput_value
value for the input, e.g. URL in case of files
Validate the PEP with eido
The command below validates the PEP aginst a remote schema. Any PEP issues will result in a ValidationError
:
eido validate refgenie_build_cfg.yaml -s http://schema.databio.org/refgenie/refgenie_build.yaml
In this guide we'll use environment variables to keep track of where stuff goes.
BASEDIR
points to our parent folder where we'll do all the building/archivingGENOMES
points to pipeline output (referenced in the project config)REFGENIE_RAW
points to a folder where the downloaded raw files are keptREFGENIE
points to the refgenie config fileREFGENIE_ARCHIVE
points to the location where we'll store the actual archives
export SERVERNAME=rg.databio.org
export BASEDIR=$PROJECT/deploy/$SERVERNAME
export GENOMES=$BASEDIR/genomes
export REFGENIE_RAW=/project/shefflab/www/refgenie_$SERVERNAME
export REFGENIE=$BASEDIR/$SERVERNAME/config/refgenie_config.yaml
export REFGENIE_ARCHIVE=$GENOMES/archive
mkdir $BASEDIR
cd $BASEDIR
To start, clone this repository:
git clone [email protected]:refgenie/$SERVERNAME.git
Many of the assets require some input files, and we have to make sure we have those files locally. In the recipe_inputs.csv
file, we have entered these files as remote URLs, so the first step is to download them. We have created a subproject called getfiles
for this: To programmatically download all the files required by refgenie build
, run from this directory using looper:
cd $SERVERNAME
mkdir -p $REFGENIE_RAW
looper run asset_pep/refgenie_build_cfg.yaml -p local --amend getfiles --sel-attr asset --sel-incl fasta
Check the status with looper check
/ looper check --itemized
looper check asset_pep/refgenie_build_cfg.yaml --amend getfiles --sel-attr asset --sel-incl fasta
This repository comes with files genome cofiguration file already defined in \config
directory, but if you have not initialized refgenie yet or want to start over, then first you can initialize the config like this:
refgenie init -c $REFGENIE -f $GENOMES -u http://awspds.refgenie.databio.org/rg.databio.org/ -a $REFGENIE_ARCHIVE -b refgenie_config_archive.yaml
Once files are present locally, we can run refgenie build
on each asset specified in the sample_table (assets.csv
). We have to submit fasta assets first:
Option A: Leveraging MapReduce programming model for concurrent builds
Since we're about to build multiple assets concurrently we will first build the assets with --map
option to store the metadata in a separate, newly created genome configuration file. This avoids any conflicts in concurrent asset builds.
Subsequently, we'll run refgenie build
with --reduce
option to combine the metadata into a single genome configuration file.
Refgenie doesn't account for assets dependancy. Therefore, as we have assets that depend on other assets, we need to take care of the dependancies ourselves:
refgenie build --map
all fasta assets to establish genome namespaces- Wait until jobs are completed, call
refgenie build --reduce
refgenie build --map
all other top-level assets, e.g. fasta_txome, gencode_gtf- Wait until jobs are completed, call
refgenie build --reduce
refgenie build --map
all derived assets, e.g. bowtie2_index, bwa_index- Wait until jobs are completed, call
refgenie build --reduce
looper run asset_pep/refgenie_build_cfg.yaml -p bulker_slurm --sel-attr asset --sel-incl fasta
This will create one job for each asset. Monitor job progress with looper check
:
looper check asset_pep/refgenie_build_cfg.yaml --sel-attr asset --sel-incl fasta --itemized
The Reduce procedure is quick, so there's no need to submit the job to the cluster, just run it locally:
refgenie build --reduce
This takes care of the first two points, repeat the above steps for the other assets.
Option B: Building all assets with Snakemake
Alternatively, you can use the Snakemake workflow in snakemake_workflow
directory. This workflow uses the inherent Snakemake's rule dependancy property to encode the refgenie build asset dependancies.
Genome and assets
By default all the genomes and all the assets specified in the asset PEP will be built. However, this can be restricted using a Snakemake workflow configuration file (config.yaml
). Therefore you need to make sure the config.yaml
is empty to build all.
To specify which genomes to build you need to specify them as a list in config.yaml
, like so:
genomes_to_process:
- hg38
- mm10
To specify which assets to exclude from building you need to specify them as a list in config.yaml
, like so:
assets_to_exclude:
- bwa_index
- ensembl_gtf
In addition to the config file, these values can be overwritten via the command line.
Compute resources
There is a pre-configured SLURM Snakemake profile included in this repository, which specified the default SLURM settings, that are adjusted on-the-fly based on the asset/genome characteristics. To use it, you need to specify the profile with --profile slurm
option.
Another thing to specify is the number of max cluster jobs running in parallel, which you need to specify with --jobs
.
You can generate a DAG of assets to be built with snakemake --dag
command:
snakemake reduce_all --dag | dot -Tsvg > dag.svg
To execute the Snakemake workflow, which will submit the jobs to the cluster, run the following:
cd snakemake_workflow
snakemake reduce_all --profile slurm --jobs 8
where reduce_all
is the name of the target rule to execute.
Assets are built locally now, but to serve them, we must archive them using refgenieserver
. The general command is refgenieserver archive -c <path/to/genomes.yaml>
. Since the archive process is generally lengthy, it makes sense to submit this job to a cluster. We can use looper to do that.
To start over completely, remove the archive config file with:
rm config/refgenie_config_archive.yaml
Then submit the archiving jobs with looper run
looper run asset_pep/refgenieserver_archive_cfg.yaml -p bulker_local --sel-attr asset --sel-incl fasta
Check progress with looper check
:
looper check asset_pep/refgenieserver_archive_cfg.yaml --sel-attr asset --sel-incl fasta
Now the archives should be built, so we'll sync them to AWS. Use the refgenie credentials (here added with --profile refgenie
, which should be preconfigured with aws configure
)
aws s3 sync $REFGENIE_ARCHIVE s3://awspds.refgenie.databio.org/rg.databio.org/ --profile refgenie
Now everything is ready to deploy. If using refgenieserver directly, you'll run refgenieserver serve config/refgenieserver_archive_cfg
. We're hosting this repository on AWS and use GitHub Actions to trigger trigger deploy jobs to push the updates to AWS ECS whenever a change is detected in the config file.
ga -A; gcm "Deploy to ECS"; gpoh