Skip to content

(Soon to be) community-curated list of software packages and data resources for deep learning for genomics (DL4G)

License

Notifications You must be signed in to change notification settings

ML4GLand/awesome-dl4g

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

awesome-dl4g

Community-curated list of software packages and data resources for techniques in deep learning for genomics (DL4G). Modeled after awesome-single-cell. Contributions welcome!

Contents

Software packages

Deep learning frameworks

  • Tensorflow - Developed by the Google Brain team (released in 2015), has a reputation as a well-documented framework with powerful visualization tools (TensorBoard) and an abundance of trained models (TensorFlow Hub). Also known to be complex and have a steep learning curve. Often used for deploying trained models to production (TensforFlow Server). Version 2.0 was released in 2019.

  • Keras - An API written in Python to simplify training models. Passes low-level computations to Backend library, which is often Tensorflow.

  • PyTorch - Developed by Facebook AI (released in 2017), has a reputation for simplicity, ease of use, flexibility, efficient memory usage and dynamic computational graphs. Often used for prototyping models and for research.

  • PyTorch Lightning - An API for PyTorch dsigned to reduce boilerplate PyTorch code and speed up the prototyping of models.

  • JAX - JAX is Autograd and XLA, brought together for high-performance numerical computing and machine learning research

DL4G Packages

  • DragoNN - [TensorFlow] - Predictive modeling of regulatory genomics, nucleotide-resolution feature discovery, and simulations for systematic development and benchmarking. (2016)

  • pysster - [TensorFlow] - A Python package for training and interpretation of convolutional neural networks on biological sequence data. (2018)

  • DeepChem - [PyTorch, TensorFlow, jax] - Open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. (2019)

  • Kipoi - [PyTorch, TensorFlow] - An API and a repository of ready-to-use trained models for genomics. Also allows for usage via the command line or R. (2019)

  • Selene - [PyTorch] - Python library and command line interface for training deep neural networks from biological sequence data such as genomes. (2019)

  • DeepAccess - [TensorFlow] - Training and interpreting CNNs for predicting cell type-specific accessibility. (2021)

  • Janggu - [Keras] - Package that facilitates deep learning in the context of genomics. Janggu provides special Genomics datasets and compatibiltity with NumPy, sklearn, and Keras. (2021)

  • GOPHER - [TensorFlow] - scripts for data preprocessing, training deep learning models for DNA sequence to epigenetic function prediction and evaluation of models. (2022)

  • ENNGene - [TensorFlow] - An application that simplifies the local training of custom Convolutional Neural Network models on Genomic data via an easy to use Graphical User Interface. (2022)

  • EUGENe - [PyTorch Lightning] - An API for running DL4G workflows with sequence-to-function models. Uses SeqData to containerize sequence data and integrates functions for data loading, model training and model intereptation from several libraries (2022)

Data wrangling

  • Nucleus - Library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF.

  • BioPython - Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.

  • scikit-bio - An open-source, BSD-licensed, python package providing data structures, algorithms, and educational resources for bioinformatics.

  • BioNumPy - A Python library for easy and efficient representation and analysis of biological data. (2022)

  • seqgra - A deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models

  • kipoiseq - Standard set of data-loaders for training and making predictions for DNA sequence-based models.

  • simdna - This is a tool for generating simulated regulatory sequence for use in experiments/analyses.

  • genome-loader - Pipeline for efficient genomic data processing.

  • PyRanges - GenomicRanges and genomic Rle-objects for Python.

  • BedTools - Swiss-army knife of tools for a wide-range of genomics analysis tasks

Model zoos

  • kipoi models - repository hosts predictive models for genomics and serves as a model source for Kipoi

  • HuggingFace Transformers - [PyTorch, TensorFlow, JAX] - Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. (2021)

Visualizations

  • vizsequence - Collecting commonly-repeated sequence visualization code here. (2019)

  • logomaker - a Python package for generating publication-quality sequence logos. (2019)

  • seqlogo - Python port of Bioconductor's seqLogo served by WebLogo. (2020)

  • TensorBoard - TensorFlow's visualization toolkit

Interpretability

  • Captum - [PyTorch] - General library for model interpretability in PyTorch

  • SHAP - SHapley Additive exPlanations game theoretic approach to explain the output of any machine learning model

  • TF-MoDISco - Biological motif discovery algorithm that differentiates itself by using attribution scores from a machine learning model,

  • fastISM - [Keras] - Keras implementation for fast in-silico saturated mutagenesis (ISM) for convolution-based architectures

  • yuzu - [PyTorch] - a compressed sensing-based approach that can make in-silico saturation mutagenesis calculations on DNA, RNA, and proteins an order of magnitude faster

  • ExpectedPatternEffect - [TensorFlow] - interpretation of trained DeepAccess models

  • Global importanace analysis - model interpretability with global importance analysis

  • Scrambler - Interpretation method for sequence-predictive models based on deep generative masking

  • DFIM - Epistatic feature interactions from neural network models of regulatory DNA sequence

Utilities

  • MEME suite - Motif-based sequence analysis tools

  • HOMER - suite of tools for Motif Discovery and next-gen sequencing analysis

  • RayTune - Python library for experiment execution and hyperparameter tuning at any scale

Models

Convolutional

  • DeepBind [paper, PyTorch, EUGENe] - One of the seminal convolutional based architectures trained to predict the binding of transcription factors and rna binding proteins.

  • DeepSEA

  • Basset

  • Basenji

  • ResidualBind

Recurrent

Hybrid

  • DanQ [paper, Keras, Selene, DeepATT, evo_aug] - Trained on the same dataset as DeepSEA to predict binarized epigenomic tracks from ENCODE and Roadmap. Added in a bi-directional LSTM layer after the convolutions and experimented with initializing convoultional filter weights with motifs.

  • DeepMEL

  • DeepFlyBrain

Autoencoder

Transformer

  • Enformer

Generative

Datasets and databases

Transcriptomic

  • GTEX
  • FANTOM5

Epigenomic

  • ENCODE
  • Roadmap

Chemoinformatics

Single cell

RNA binding

  • RNA complete - in vitro RNA-binding protein assay of 244 RNA binding proteins. The dataset is downloaded as a single TSV file with RNA probes as rows and RNA binding proteins (RBP) as columns. Each entry in the table is an intensity measurement (can be normalized or raw) of the binding of each protein to each probe. There are over 244 RBP columns and 241,357 sequences spanning two sets (SetA and SetB)

Tutorials and workflows

Journal articles of general interest

Paper collections

Similar lists and collections

Awesome people

About

(Soon to be) community-curated list of software packages and data resources for deep learning for genomics (DL4G)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published