This repository contains necessary files to build and archive reference genome assets to serve with refgenieserver
. The whole process is scripted, starting from this repository. From here, we download the input data (FASTA files), use refgenie build
to create all of these assets in a local refgenie instance, and then use refgenieserver archive
to build the server archives, and finally serve them with a refgenieserver instance by calling refgenieserver serve
.
- Add a line for each genome in genome_descriptions.csv.
- Add a line for each asset in assets.csv.
- Add any required build inputs (links to files, parameters, etc) to recipe_inputs.csv
In this guide we'll use environment variables to keep track of where stuff goes.
BASEDIR
points to our parent folder where we'll do all the building/archivingGENOMES
points to pipeline output (referenced in the project config)REFGENIE_RAW
points to a folder where the downloaded raw files are keptREFGENIE
points to the refgenie config fileREFGENIE_ARCHIVE
points to the location where we'll store the actual archives
export SERVERNAME=btref.databio.org
export BASEDIR=$PROJECT/deploy/$SERVERNAME
export BASEDIR=$HOME/refgenie_test # For local testing
export GENOMES=$BASEDIR/genomes
export REFGENIE_RAW=$BASEDIR/refgenie_$SERVERNAME
export REFGENIE=$BASEDIR/$SERVERNAME/config/refgenie_config.yaml
export REFGENIE_ARCHIVE=$GENOMES/archive
mkdir -p $BASEDIR
cd $BASEDIR
To start, clone this repository:
cp -R $CODE/$SERVERNAME . # For local testing
<!-- git clone [email protected]:refgenie/$SERVERNAME.git -->
Many of the assets require some input files, and we have to make sure we have those files locally. In the recipe_inputs.csv
file, we have entered these files as remote URLs, so the first step is to download them. We have created a subproject called getfiles
for this: To programmatically download all the files required by refgenie build
, run from this directory using looper:
cd $SERVERNAME
mkdir -p $REFGENIE_RAW
looper run asset_pep/refgenie_build_cfg.yaml -p local --amend getfiles --sel-attr asset --sel-incl fasta
Check the status with looper check --use-pipesat
:
--use-pipesat
option is required as of early 2021. Might not be required if you're running later on.
looper check asset_pep/refgenie_build_cfg.yaml --amend getfiles --sel-attr asset --sel-incl fasta --use-pipestat
This repository comes with a genome configuration file already defined in \config
directory, but if you have not initialized refgenie yet or want to start over, then first you can initialize the config like this:
refgenie init -c config/refgenie_config.yaml -f $GENOMES -u http://awspds.refgenie.databio.org/refgenomes.databio.org/ -a $GENOMES/archive -b refgenie_config_archive.yaml
Once files are present locally, we can run refgenie build
on each asset specified in the sample_table (assets.csv
). We have to submit fasta assets first:
looper run asset_pep/refgenie_build_cfg.yaml -p bulker_local --sel-attr asset --sel-incl fasta
This will create one job for each asset. Monitor job progress with:
grep CANCELLED ../genomes/submission/*.log
ll ../genomes/submission/*.log
grep error ../genomes/submission/*.log
grep maximum ../genomes/submission/*.log
ll ../genomes/data/*/*/*/_refgenie_build/*.flag
ll ../genomes/data/*/*/*/_refgenie_build/*failed.flag
ll ../genomes/data/*/*/*/_refgenie_build/*completed.flag
ll ../genomes/data/*/*/*/_refgenie_build/*running.flag
ll ../genomes/data/*/*/*/_refgenie_build/*completed.flag | wc -l
cat ../genomes/submission/*.log
To run all the asset types:
looper run asset_pep/refgenie_build_cfg.yaml -p bulker_local
Assets are built locally now, but to serve them, we must archive them using refgenieserver
. The general command is refgenieserver archive -c <path/to/genomes.yaml>
. Since the archive process is generally lengthy, it makes sense to submit this job to a cluster. We can use looper to do that.
To start over completely, remove the archive config file with:
rm config/refgenie_config_archive.yaml
Then submit the archiving jobs with looper run
looper run asset_pep/refgenieserver_archive_cfg.yaml -p bulker_local --sel-attr asset --sel-incl fasta
Check progress with:
ll ../genomes/archive_logs/submission/*.log
grep Wait ../genomes/archive_logs/submission/*.log
grep Error ../genomes/archive_logs/submission/*.log
cat ../genomes/archive_logs/submission/*.log
Now the archives should be built, so we'll sync them to AWS. Use the refgenie credentials (here added with --profile refgenie
, which should be preconfigured with aws configure
)
aws s3 sync $REFGENIE_ARCHIVE s3://awspds.refgenie.databio.org/btref.databio.org/ --profile refgenie
Now everything is ready to deploy. If using refgenieserver directly, you'll run refgenieserver serve config/refgenieserver_archive_cfg
. We're hosting this repository on AWS and use GitHub Actions to trigger trigger deploy jobs to push the updates to AWS ECS whenever a change is detected in the config file.
ga -A; gcm "Deploy to ECS"; gpoh