Skip to content

Latest commit

 

History

History
287 lines (211 loc) · 11.9 KB

README.md

File metadata and controls

287 lines (211 loc) · 11.9 KB

GATK-based Snakemake pipeline for deep mutational scanning experiments

This repository contains the Snakemake-based workflow for implementing deep mutational scanning experiments used in the Fraser and Coyote-Maestas labs.

Briefly, this conducts initial QC and mapping using BBTools, followed by the AnalyzeSaturationMutagenesis GATK module to call variants in each replicate. After variant calling, the list of observed variants in each read is filtered based on the list of designed variants, and the resulting counts are used to infer the fitness of each variant using Rosace and (optionally) Enrich2.

The pipeline is designed to be flexible and modular and should be amenable to use with a variety of experimental designs. Please note several current limitations, however.

Quick start

git clone https://github.com/odcambc/dumpling
cd dumpling
conda env create --file dumpling_env.yaml
conda activate dumpling_env

Note that, on ARM-based Macs, the conda environment may fail to install due to required packages not being available for that platform. Assuming that Rosetta is installed, the environment can be installed using emulation with the following command:

Installation for ARM-based Macs:

CONDA_SUBDIR=osx-64 conda env create --file dumpling_env.yaml
CONDA_SUBDIR=osx-64 conda env create --name enrich2 --file workflow/envs/enrich2.yaml

conda env create --platform osx-64 --name enrich2_arm64

You will also need to set the "samtools_local" variable in the config yaml to "true" to tell the pipeline to use this local version.

If the environment installed and activated properly, edit the configuration files in the config directory as needed. Then run the pipeline with:

snakemake -s workflow/Snakefile --software-deployment-method conda --cores 16

Testing the pipeline with example data

To test the pipeline with example data and examine the output, you can use the file provided in the dumpling-example repository. This repository contains a small dataset and configuration files that can be used to test the pipeline. To use it, clone the repository and move the data directory into the dumpling directory, then create the environment and run the pipeline as above. The repository also includes output files from running the pipeline on the example data that can be used to compare results.

Installation

Install via GitHub

Download or fork this repository and edit the configuration files as needed.

Installing Rosace

This pipeline uses the Rosace scoring tool. Rosace uses CmdStanR and R to infer scores.

Dumpling uses renv to handle R dependencies. This pipeline also includes a minimal faculty to install Rosace automatically, but issues are possible. This can be invoked by calling the install_rosace rule:

snakemake --cores 8 install_rosace

This tries to install renv, restore the renv environment, and install Rosace and CmdStanR. If this fails, please try installing Rosace manually.

We recommend installing Rosace manually before running the pipeline, or at least verifying that the install script works. More details about manually installing Rosace are available in the vignettes of the package and at the repository linked above.

Issues installing Rosace on OSX

Rosace requires a C++ and fortran compiler to install required dependencies. R, by default, requires these to be installed in /opt/gfortran. User installs (via Homebrew, for example) may not work. If you encounter an error compiling packages for Rosace, you may need to install the gfortran compiler from R.

See https://cran.r-project.org/bin/macosx/tools/ for more details.

Dependencies

Via conda (recommended)

The simplest way to handle dependencies is with Conda and the provided environment file.

conda env create --file dumpling_env.yaml

This will create a new environment named dumpling with all the dependencies installed. Then simply activate the environment and you're ready to go.

conda activate dumpling_env

Manually

The following are the dependencies required to run the pipeline:

Configuration

Configuration files

The details of an experiment need to be specified in a configuration file that defines parameters and an associated experiment file that details the experimental setup.

The configuration file is a YAML file: full details are included in the example file config/test_config.yaml and in the schema file schemas/config.schema.yaml.

The experiment file is a CSV file that relates experimental conditions, replicates, and time points to sequencing files: full details are included in the config file and in the schema file schemas/experiments.schema.yaml.

Additionally, a reference fasta file is required for mapping. This should be placed in the references directory, and the path to the file should be specified in the config file.

This pipeline also employs a processing step to standardize variant nomenclature and remove any variants that are not designed or are likely errors. This requires a CSV file containing the set of designed variants, including their specific codon changes. This should be placed in the config/designed_variants directory, and the path to the file should be specified in the config file. An example file is included in config/designed_variants/test_variants.csv. This pipeline can generate the variants CSV from the output set of oligos produced by the DIMPLE library generation protocol: this can be enabled by including the path to the oligo CSV file in the config file and setting regenerate_variants to True in the config.

Working directory structure

The pipeline has the following directory structure:

├── workflow
│   ├── rules
│   ├── envs
│   ├── scripts
│   └── Snakefile
├── config
│   ├── test_config.yaml
│   ├── test_config.csv
│   ├── designed_variants
│   │   └── test_variants.csv
│   └── oligos
│       └── test_oligos.csv
├── logs
│   └── ...
├── references
│   └── test_ref.fasta
├── results
│   └── ...
├── schemas
│   ├── config.schema.yaml
│   └── experiments.schema.yaml
├── stats
│   └── ...
├── resources
│   ├── adapters.fa
│   ├── sequencing_artifacts.fa.gz
│   └── ...

Usage

We normally use one instance of the pipeline for each experiment. This allows for simpler tracking and reproducibility of individual experiments: for a new dataset, fork the repo, edit the configuration files, and run the pipeline. This way, a record of the exact configuration and environment can be saved. It is possible to run multiple experiments in the same folder, but this is more difficult to reproduce.

Running the pipeline

Once the dependencies have been installed (whether via conda or otherwise) the pipeline can be run with the following command:

snakemake -s workflow/Snakefile --software-deployment-method conda --cores 8

The maximum number of cores can be specified with the --cores flag. The --software-deployment-method conda flag tells Snakemake to use conda to create the environment specified within each rule.

Output files

The pipeline generates a variety of output files. These are organized into the following directories:

  • benchmarks: details of the runtime and process usage for each rule
  • logs: log files from each rule
  • results: outputs from each rule (Note: many of these are intermediate files and are deleted by default).
  • stats: various processing statistics from each rule
  • ref: mapping target files generated by BBTools

These are ignored by git by default.

Analyzing results

QC metrics

A variety of stats from tool outputs are provided in the stats directory. These are aggregated using MultiQC. The aggregated reports contain:

  • FastQC reports for raw reads (read counts, base quality, adapter content, etc.)
  • BBTools reports
    • BBDuk reports for adapter trimming and contamination removal
    • BBMerge reports for merging paired-end reads
    • BBMap reports for mapping reads to the reference
  • GATK AnalyzeSaturationMutagenesis reports for variant calling
  • Reports for variant filtering

If a baseline condition is defined, a separate baseline report is also generated.

The files are saved as stats/{experiment_name}_multiqc_report.html and stats/{experiment_name}_baseline_multiqc_report.html by default.

Data analysis

A starting analysis and plotting workflow is available in an associated repository: https://github.com/odcambc/dms_analysis_stub

Limitations

We aim to regularly update this pipeline and continually expand its functionality. However, there are currently several known limitations.

  • The pipeline is currently designed for short-read sequencing. It does not support long-read PacBio or Nanopore sequencing.
  • The pipeline is currently designed for direct sequencing. It does not support barcoded sequencing.
  • The pipeline is currently designed for single-site variants (including varying-length indels, however). It largely does not support combinatorial variants.
  • The designed variant generation step is currently optimized for DIMPLE libraries. Other protocols may require the user to generate the designed variants CSV themself.
  • Rosace is designed for growth-based experiments. It is not optimized for FACS-seq experiments.
  • This pipeline may not work properly if the data is in a cloud server (i.e., a Box drive) or other non-standard file system.

Citations

This workflow is described in the following publication:

License

This is licensed under the MIT license. See the LICENSE file for details.

Contributing

Contributions and feedback are welcome. Please submit an issue or pull request.

Getting help

For any issues, please open an issue on the GitHub repository. For questions or feedback, email Chris.