APPT: Affinity Protein-Protein Transformer

Overview

APPT (Affinity Protein-Protein Transformer) is a state-of-the-art model for predicting protein-protein binding affinity, leveraging advanced transformer architectures and the ESM protein language model. Designed to support drug discovery, protein engineering, and biological research, APPT excels in handling protein sequence pairs and delivering precise binding affinity predictions.

Features

Transformer-Based Architecture: Utilizes attention mechanisms for capturing long-range dependencies in protein-protein interactions
ESM Integration: Leverages the ESM protein language model for high-quality protein sequence embeddings
Efficient Caching: Implements smart caching of protein embeddings to speed up repeated predictions
Hyperparameter Optimization: Includes Optuna-based hyperparameter optimization
Command-Line Interface: Easy-to-use CLI for making predictions

Installation

For extra assistance installing please see our Bindwell AI

Clone the repository:

git clone https://github.com/Bindwell/APPT.git
cd APPT

Install dependencies and model files:

pip install -r requirements.txt
cd embedding_cache_2560
wget https://huggingface.co/Bindwell/APPT/resolve/main/caches.pt
cd ..
mkdir models/
cd models
wget https://huggingface.co/Bindwell/APPT/resolve/main/protein_protein_affinity_esm_vs_ankh_best.pt
cd ..

Or, if you prefer conda

conda env create -f environment.yml
conda activate bindwell_appt

Run Inference
```
python cli.py --sequences ABC DEF
```

Docker

You can also run the model using Docker:

Build the Docker image locally:
```
docker build -t appt .
```
Or pull the pre-built image from Docker Hub:
```
docker pull cford38/appt:latest
```

Run the Docker container:

docker run --gpus all --rm --name appt -it appt /bin/bash
# docker run --gpus all --rm --name appt -it cford38/appt:latest /bin/bash

Performance Evaluation

Our model demonstrates superior performance compared to existing methods across multiple metrics:

Mean Square Error Comparison

The graph shows that APPT achieves a significantly lower MSE (2.868) compared to PPB-Affinity (4.871), indicating better prediction accuracy. Lower values indicate better performance.

Correlation Analysis

APPT (labeled as "OURS") achieves the highest R value of 0.79, substantially outperforming other methods:

PPB-Affinity: 0.67
Specific Geometry: 0.53
ASA and RMSD: 0.31

Comprehensive Method Comparison

In terms of R² values, our method ("OURS") achieves 0.68, significantly outperforming other approaches:

AffinityScore: 0.25
Rosetta: 0.27
PyDock: 0.28
Other methods: < 0.20

Model Architecture

APPT incorporates a transformer-based model with the following components:

Base Architecture

Protein embedding using ESM model (2560-dimensional embeddings)
Projection layer to transform embeddings
Multi-head self-attention layers
Feed-forward prediction head

Default Hyperparameters

Input Dimension: 2560 (ESM embedding size)
Embedding Dimension: 384
Linear Dimension: 160
Number of Attention Layers: 4
Attention Heads: 4
Batch Size: 16
Dropout Rate: 0.1
Learning Rate: 6.30288565853412e-05

Training

Custom dataset

To train APPT on your dataset:

Prepare your dataset in CSV format with columns:
- protein1_sequence
- protein2_sequence
- pkd (binding affinity)
Replace the Data.csv file in index.py
Run the training script:
```
python index.py
```

The training process includes:

Automatic hyperparameter optimization using Optuna
Early stopping and model checkpointing
Comprehensive logging and visualization
Caching of protein embeddings

APPT Prediction CLI

A command-line interface for predicting binding affinity between protein pairs.

Usage

You can provide input either through a CSV file containing protein sequences or by directly passing sequence pairs:

Using CSV input:

python cli.py \
    --input_csv data/protein_pairs.csv \
    --protein1_col protein1_sequence \
    --protein2_col protein2_sequence \
    --output_dir results \
    --model_path models/protein_protein_affinity_esm_vs_ankh_best.pt \
    --device cuda

Using raw sequences:

python cli.py \
    --sequences SEQUENCE1A SEQUENCE1B \
    --sequences SEQUENCE2A SEQUENCE2B \
    --output_dir results \
    --model_path models/protein_protein_affinity_esm_vs_ankh_best.pt \
    --device cpu

Arguments

Required (mutually exclusive):

--input_csv: Path to CSV file containing protein sequences
--sequences: Raw protein sequence pairs (can be specified multiple times for multiple pairs)

Optional:

--protein1_col: Column name for first protein sequences in CSV (default: 'protein1_sequence')
--protein2_col: Column name for second protein sequences in CSV (default: 'protein2_sequence')
--output_dir: Directory to save results (default: 'output')
--model_path: Path to trained model checkpoint (default: 'models/protein_protein_affinity_esm_vs_ankh_best.pt')
--data_path: Path to training data for normalization parameters (default: 'Data.csv')
--batch_size: Batch size for inference (default: 16)
--device: Device to run inference on (default: cuda if available, else cpu)

Model Configuration:

--input_dim: Input dimension of protein embeddings (default: 2560)
--embedding_dim: Embedding dimension for transformer (default: 384)
--linear_dim: Linear layer dimension (default: 160)
--num_attention_layers: Number of transformer attention layers (default: 2)
--num_heads: Number of attention heads (default: 4)
--dropout_rate: Dropout rate (default: 0.1)

Output

The script saves results to a CSV file in the specified output directory (default: 'output/binding_predictions.csv') with the following columns:

Protein1_Sequence: First protein sequence
Protein2_Sequence: Second protein sequence
Predicted_pKd: Predicted binding affinity (pKd value)

Example output CSV:

Protein1_Sequence,Protein2_Sequence,Predicted_pKd
SEQUENCE1A,SEQUENCE1B,7.24
SEQUENCE2A,SEQUENCE2B,6.85

Project Structure

index.py: Main training script and model implementation
cli.py: Command-line interface for predictions
output/: Directory containing:
- models/: Trained model checkpoints
- embedding_cache/: Cached protein embeddings
- hyperopt_results.json: Hyperparameter optimization results
- Figures: Various visualization plots

Output Files

The training process generates several output files:

Model checkpoints
Hyperparameter optimization results
Training history plots
Prediction vs. actual plots
Detailed logs

Contributing

We welcome contributions to improve APPT. Please feel free to submit issues and pull requests.

License

This project is distributed under the CC BY NC SA License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Figures		Figures
benchmark		benchmark
data		data
output		output
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
environment.yml		environment.yml
figure.py		figure.py
index.py		index.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

APPT: Affinity Protein-Protein Transformer

Overview

Features

Installation

Docker

Performance Evaluation

Mean Square Error Comparison

Correlation Analysis

Comprehensive Method Comparison

Model Architecture

Base Architecture

Default Hyperparameters

Training

Custom dataset

APPT Prediction CLI

Usage

Using CSV input:

Using raw sequences:

Arguments

Required (mutually exclusive):

Optional:

Model Configuration:

Output

Project Structure

Output Files

Contributing

License

About

Releases

Packages

Contributors 4

Languages

License

Bindwell/APPT

Folders and files

Latest commit

History

Repository files navigation

APPT: Affinity Protein-Protein Transformer

Overview

Features

Installation

Docker

Performance Evaluation

Mean Square Error Comparison

Correlation Analysis

Comprehensive Method Comparison

Model Architecture

Base Architecture

Default Hyperparameters

Training

Custom dataset

APPT Prediction CLI

Usage

Using CSV input:

Using raw sequences:

Arguments

Required (mutually exclusive):

Optional:

Model Configuration:

Output

Project Structure

Output Files

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages