Project aims:
- Facilitate the use of DeepLearning based biological sequence representations for transfer-learning by providing a single, consistent interface and close-to-zero-friction
- Reproducible workflows
- Depth of representation (different models from different labs trained on different dataset for different purposes)
- Extensive examples, handle complexity for users (e.g. CUDA OOM abstraction) and well documented warnings and error messages.
The project includes:
- General purpose python embedders based on open models trained on biological sequence representations (SeqVec, ProtTrans, UniRep,...)
- A pipeline which:
- embeds sequences into matrix-representations (per-amino-acid) or vector-representations (per-sequence) that can be used to train learning models or for analytical purposes
- projects per-sequence embedidngs into lower dimensional representations using UMAP or t-SNE (for lightwieght data handling and visualizations)
- visualizes low dimensional sets of per-sequence embeddings onto 2D and 3D interactive plots (with and without annotations)
- extracts annotations from per-sequence and per-amino-acid embeddings using supervised (when available) and unsupervised approaches (e.g. by network analysis)
- A webserver that wraps the pipeline into a distributed API for scalable and consistent workfolws
We presented the bio_embeddings pipeline as a talk at ISMB 2020. You can find the talk on YouTube, and the poster on F1000.
- Integrated Evolutionary Scale Modeling (ESM) from "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019)
- Included example to transfer GO annotations (a-la goPredSim). We also make the reference annotations and embeddings available!
- We've added the language models ESM, PLUS and CPCProt
You can install bio_embeddings
via pip or use it via docker.
Install the pipeline like so:
pip install bio-embeddings[all]
To get the latest features, please install the pipeline like so:
pip install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git"
We provide a docker image at rostlab/bio_embeddings
. Simple usage example:
docker run --rm --gpus all \
-v "$(pwd)/examples/docker":/mnt \
-u $(id -u ${USER}):$(id -g ${USER}) \
rostlab/bio_embeddings /mnt/config.yml
See the docker
example in the examples
folder for instructions. We currently have published rostlab/bio_embeddings:develop
. For our next stable release, we will publish tags for all releases and a latest
tag pointing to the latest release.
bio_embeddings
was developed for unix machines with GPU capabilities and CUDA installed. If your setup diverges from this, you may encounter some inconsitencies (e.g. speed is significantly affected by the absence of a GPU and CUDA). For Windows users, we strongly recommend the use of Windows Subsystem for Linux.
Each models has its strengths and weaknesses (speed, specificity, memory footprint...). There isn't a "one-fits-all" and we encourage you to at least try two different models when attempting a new exploratory project.
The models prottrans_bert_bfd
, prottrans_albert_bfd
, seqvec
and prottrans_xlnet_uniref100
were all trained with the goal of systematic predictions. From this pool, we believe the optimal model to be prottrans_bert_bfd
, followed by seqvec
, which has been established for longer and uses a different principle (LSTM vs Transformer).
We highly recommend you to check out the examples
folder for pipeline examples, and the notebooks
folder for post-processing pipeline runs and general purpose use of the embedders.
After having installed the package, you can:
-
Use the pipeline like:
bio_embeddings config.yml
A blueprint of the configuration file, and an example setup can be found in the
examples
directory of this repository. -
Use the general purpose embedder objects via python, e.g.:
from bio_embeddings.embed import SeqVecEmbedder embedder = SeqVecEmbedder() embedding = embedder.embed("SEQVENCE")
More examples can be found in the
notebooks
folder of this repository.
While we are working on a proper publication, if you are already using this tool, we would appreciate if you could cite the following poster:
Dallago C, Schütze K, Heinzinger M et al. bio_embeddings: python pipeline for fast visualization of protein features extracted by language models [version 1; not peer reviewed]. F1000Research 2020, 9(ISCB Comm J):876 (poster) (doi: 10.7490/f1000research.1118163.1)
- Christian Dallago (lead)
- Konstantin Schütze
- Tobias Olenyi
- Michael Heinzinger
Pipeline stages
- embed:
- ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
- ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554)
- Fastext
- Glove
- Word2Vec
- UniRep (https://www.nature.com/articles/s41592-019-0598-1)
- ESM (https://www.biorxiv.org/content/10.1101/622803v3)
- PLUS (https://github.com/mswzeus/PLUS/)
- CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)
- project:
- t-SNE
- UMAP
- visualize:
- 2D/3D sequence embedding space
- extract:
- supervised:
- SeqVec: DSSP3, DSSP8, disorder, subcellular location and membrane boundness as in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8
- Bert: DSSP3, DSSP8, disorder, subcellular location and membrane boundness as in https://doi.org/10.1101/2020.07.12.199554
- unsupervised:
- via sequence-level (reduced_embeddings), pairwise distance (euclidean like goPredSim, more options available, e.g. cosine)
- supervised:
Web server (unpublished)
- SeqVec supervised predictions
- Bert supervised predictions
- SeqVec unsupervised predictions for GO: CC, BP,..
- Bert unsupervised predictions for GO: CC, BP,..
- SeqVec unsupervised predictions for SwissProt (just a link to the 1st-k-nn)
- Bert unsupervised predictions for SwissProt (just a link to the 1st-k-nn)
General purpose embedders
- ProtTrans BERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- SeqVec (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3220-8)
- ProtTrans ALBERT trained on BFD (https://doi.org/10.1101/2020.07.12.199554)
- ProtTrans XLNet trained on UniRef100 (https://doi.org/10.1101/2020.07.12.199554)
- Fastext
- Glove
- Word2Vec
- UniRep (https://www.nature.com/articles/s41592-019-0598-1)
- ESM (https://www.biorxiv.org/content/10.1101/622803v3)
- PLUS (https://github.com/mswzeus/PLUS/)
- CPCProt (https://www.biorxiv.org/content/10.1101/2020.09.04.283929v1.full.pdf)
Building the packages best happens using invoke.
If you manage your dependencies with poetry this should be already installed.
Simply use poetry run invoke clean build
to update your requirements according to your current status
and to generate the dist files