A fast deep learning method for large-scale protein-protein interaction screening.
-
Clone the repository:
git clone https://github.com/CongLabCode/RoseTTAFold2-PPI.git
-
Download the weights to RoseTTAFold2-PPI/src/model:
cd RoseTTAFold2-PPI/src/models wget --no-check-certificate https://conglab.swmed.edu/humanPPI/downloads/RF2-PPI.pt -
Choose one of the following installation methods:
A. Download our singularity image (hhsuite is needed if you want to use our paired MSA generation script):
cd RoseTTAFold2-PPI/ wget --no-check-certificate https://conglab.swmed.edu/humanPPI/downloads/SE3nv-20230612.sif conda install -c conda-forge -c bioconda hhsuiteB. Install conda environment (if cannot use singularity):
conda create -n rf2ppi python=3.9 conda activate rf2ppi conda install -c conda-forge -c bioconda hhsuite pip install numpy==1.21.2 pip install pandas==1.5.3 pip install torch==1.12.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html pip install biopython==1.79 pip install scipy==1.7.1 pip install einops
-
using singularity:
singularity exec \ --bind /path/to/input_and_output_directory:/work/users \ --bind /path/to/rosettafold2-ppi/directory:/home/RoseTTAFold2-PPI \ --nv SE3nv_20230612.sif \ /bin/bash -c "cd /work/users && python /home/RoseTTAFold2-PPI/src/predict_list_PPI.py -list_fn input_file -model_file model_file"
-
using conda environment:
conda activate rf2ppi python [/path/to/]RoseTTAFold2-PPI/src/predict_list_PPI.py -list_fn [input_file] -model_file [/path/to]/RoseTTAFold2-PPI/src/models/RF2-PPI.pt
For the [input_file], e.g., examples/input_file, each line should contain two columns:
- File path of the paired multiple sequence alignment (MSA) input.
- Length of the first protein.
Note: When using Singularity, paths should be relative to the directories mounted inside the container. If you prefer to use absolute paths, ensure they reference the file paths inside the container after mounting the directories.
A simplified pipeline to generate paired MSAs is as follows:
- Search homologs for each protein using tools like HHblits (be sure to turn on "-all" flag in HHblits to output all the hits)
- Identify the closest hit (by sequence identity) to the query from each organism, and discard other hits
- Combine the resulting MSAs of both proteins by concatenating the two hit sequences (one for each query) from the same organism
- Discard sequences that cannot be paired
- Remove redundancy by 90% or 95% sequence identity using hhfilter.
For most human proteins, you can generate paired MSAs using the omicMSAs for single proteins we shared at https://conglab.swmed.edu/humanPPI/humanPPI_download.html
We share omicMSAs both as entire proteins (protein_omicMSAs.tar.gz) and segments (segment_omicMSAs.tar.gz, breaking long proteins into shorter segments and excluding low-quality positions in the MSAs).
To generate paired MSAs from the omicMSAs we shared, please use the following commands:
python /path/to/RoseTTAFold2-PPI/generate_protein_pair_MSA.py [list_of_protein_pairs] [directory_with_single_protein_MSAs] [output_directory_with_paired_MSAs]python /path/to/RoseTTAFold2-PPI/generate_segment_pair_MSA.py [list_of_segment_pairs] [directory_with_single_segment_MSAs] [output_directory_with_paired_MSAs]Note: The performance is affected by the quality of the paired MSAs. Our benchmarks suggest the following best practice can enhance the accuracy of RoseTTAFold2-PPI:
- Get deeper MSAs
- Remove low-quality regions, corresponding to poorly conserved intrinsically disordered regions
- Only include paired MSAs and remove any unpaired sequences
- Remove redundancy sequences at 90% or 95% sequence identity after "pairing"
The output file will be saved as [input_file].npz and [input_file].log, e.g., those in examples/expected_output.
The log file contains three columns:
- The input MSA file name
- Predicted Interaction probability for a protein/segment pair
- Compute time
The npz file contains the inter-residue interaction probabilities. The input MSA file names were used as keys that point to a numpy matrix containing the predicted interaction probability between a residue in the first protein and a residue in the second. This matrix has the shape of (L1, L2), where L1 and L2 are the lengths of the two proteins.
Similar to AlphaFold2, predicted interaction probability by RoseTTAFold-PPI is not deterministic. Our benchmark suggests that the standard deviation in predicted interaction probabilities might exceed 0.1 (out of 1) for about 5% of cases (Fig. A below), and such variability is more obvious for pairs with intermediate interaction probabilities (Fig. B below).
-
using singularity:
cd RoseTTAFold2-PPI exec_dir=$(pwd) singularity exec \ --bind $exec_dir:/home/RoseTTAFold2-PPI \ --nv SE3nv.sif \ /bin/bash -c "cd /home/RoseTTAFold2-PPI && python /home/RoseTTAFold2-PPI/src/predict_list_PPI.py -list_fn examples/protein_pairs_input -model_file src/models/RF2-PPI.pt"
-
using conda environment:
conda activate rf2ppi cd RoseTTAFold2-PPI/examples python ../src/predict_list_PPI.py -list_fn segment_pairs_input -model_file ../src/models/RF2-PPI.pt
These commands will generate outputs similar to those in examples/expected_output.
Jing Zhang*, Ian R Humphreys*, Jimin Pei*, Jinuk Kim, Chulwon Choi, Rongqing Yuan, Jesse Durham, Siqi Liu, Hee-Jung Choi, Minkyung Baek, David Baker, Qian Cong. Computing the Human Interactome. (https://www.biorxiv.org/content/10.1101/2024.10.01.615885v1)
