Community-curated list of software packages and data resources for techniques in deep learning for genomics (DL4G). Modeled after awesome-single-cell. Contributions welcome!
- Software packages
- Models
- Datasets and databases
- [Transcriptomic]
- [Epigenomic]
- Interpetation methods
- [Neuron visualizations]
- [Feature attributions]
- [In-silico]
- Tutorials and workflows
- Journal articles of general interest
- Similar lists and collections
- Awesome people
-
Tensorflow - Developed by the Google Brain team (released in 2015), has a reputation as a well-documented framework with powerful visualization tools (TensorBoard) and an abundance of trained models (TensorFlow Hub). Also known to be complex and have a steep learning curve. Often used for deploying trained models to production (TensforFlow Server). Version 2.0 was released in 2019.
-
Keras - An API written in Python to simplify training models. Passes low-level computations to Backend library, which is often Tensorflow.
-
PyTorch - Developed by Facebook AI (released in 2017), has a reputation for simplicity, ease of use, flexibility, efficient memory usage and dynamic computational graphs. Often used for prototyping models and for research.
-
PyTorch Lightning - An API for PyTorch dsigned to reduce boilerplate PyTorch code and speed up the prototyping of models.
-
JAX - JAX is Autograd and XLA, brought together for high-performance numerical computing and machine learning research
-
DragoNN - [TensorFlow] - Predictive modeling of regulatory genomics, nucleotide-resolution feature discovery, and simulations for systematic development and benchmarking. (2016)
-
pysster - [TensorFlow] - A Python package for training and interpretation of convolutional neural networks on biological sequence data. (2018)
-
DeepChem - [PyTorch, TensorFlow, jax] - Open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology. (2019)
-
Kipoi - [PyTorch, TensorFlow] - An API and a repository of ready-to-use trained models for genomics. Also allows for usage via the command line or R. (2019)
-
Selene - [PyTorch] - Python library and command line interface for training deep neural networks from biological sequence data such as genomes. (2019)
-
DeepAccess - [TensorFlow] - Training and interpreting CNNs for predicting cell type-specific accessibility. (2021)
-
Janggu - [Keras] - Package that facilitates deep learning in the context of genomics. Janggu provides special Genomics datasets and compatibiltity with NumPy, sklearn, and Keras. (2021)
-
GOPHER - [TensorFlow] - scripts for data preprocessing, training deep learning models for DNA sequence to epigenetic function prediction and evaluation of models. (2022)
-
ENNGene - [TensorFlow] - An application that simplifies the local training of custom Convolutional Neural Network models on Genomic data via an easy to use Graphical User Interface. (2022)
-
EUGENe - [PyTorch Lightning] - An API for running DL4G workflows with sequence-to-function models. Uses SeqData to containerize sequence data and integrates functions for data loading, model training and model intereptation from several libraries (2022)
-
Nucleus - Library of Python and C++ code designed to make it easy to read, write and analyze data in common genomics file formats like SAM and VCF.
-
BioPython - Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
-
scikit-bio - An open-source, BSD-licensed, python package providing data structures, algorithms, and educational resources for bioinformatics.
-
BioNumPy - A Python library for easy and efficient representation and analysis of biological data. (2022)
-
seqgra - A deep learning pipeline that incorporates the rule-based simulation of biological sequence data and the training and evaluation of models
-
kipoiseq - Standard set of data-loaders for training and making predictions for DNA sequence-based models.
-
simdna - This is a tool for generating simulated regulatory sequence for use in experiments/analyses.
-
genome-loader - Pipeline for efficient genomic data processing.
-
PyRanges - GenomicRanges and genomic Rle-objects for Python.
-
BedTools - Swiss-army knife of tools for a wide-range of genomics analysis tasks
-
kipoi models - repository hosts predictive models for genomics and serves as a model source for Kipoi
-
HuggingFace Transformers - [PyTorch, TensorFlow, JAX] - Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. (2021)
-
vizsequence - Collecting commonly-repeated sequence visualization code here. (2019)
-
logomaker - a Python package for generating publication-quality sequence logos. (2019)
-
seqlogo - Python port of Bioconductor's seqLogo served by WebLogo. (2020)
-
TensorBoard - TensorFlow's visualization toolkit
-
Captum - [PyTorch] - General library for model interpretability in PyTorch
-
SHAP - SHapley Additive exPlanations game theoretic approach to explain the output of any machine learning model
-
TF-MoDISco - Biological motif discovery algorithm that differentiates itself by using attribution scores from a machine learning model,
-
fastISM - [Keras] - Keras implementation for fast in-silico saturated mutagenesis (ISM) for convolution-based architectures
-
yuzu - [PyTorch] - a compressed sensing-based approach that can make in-silico saturation mutagenesis calculations on DNA, RNA, and proteins an order of magnitude faster
-
ExpectedPatternEffect - [TensorFlow] - interpretation of trained DeepAccess models
-
Global importanace analysis - model interpretability with global importance analysis
-
Scrambler - Interpretation method for sequence-predictive models based on deep generative masking
-
DFIM - Epistatic feature interactions from neural network models of regulatory DNA sequence
-
MEME suite - Motif-based sequence analysis tools
-
HOMER - suite of tools for Motif Discovery and next-gen sequencing analysis
-
RayTune - Python library for experiment execution and hyperparameter tuning at any scale
-
DeepBind [paper, PyTorch, EUGENe] - One of the seminal convolutional based architectures trained to predict the binding of transcription factors and rna binding proteins.
-
DeepSEA
-
Basset
-
Basenji
-
ResidualBind
-
DanQ [paper, Keras, Selene, DeepATT, evo_aug] - Trained on the same dataset as DeepSEA to predict binarized epigenomic tracks from ENCODE and Roadmap. Added in a bi-directional LSTM layer after the convolutions and experimented with initializing convoultional filter weights with motifs.
-
DeepMEL
-
DeepFlyBrain
- Enformer
- GTEX
- FANTOM5
- ENCODE
- Roadmap
- RNA complete - in vitro RNA-binding protein assay of 244 RNA binding proteins. The dataset is downloaded as a single TSV file with RNA probes as rows and RNA binding proteins (RBP) as columns. Each entry in the table is an intensity measurement (can be normalized or raw) of the binding of each protein to each probe. There are over 244 RBP columns and 241,357 sequences spanning two sets (SetA and SetB)