Skip to content

KrishnaswamyLab/ProtSCAPE-Net

Repository files navigation

ProtSCAPE-Net

ProtSCAPE-Net - Learning Protein Conformational Landscapes from Molecular Dynamics for Ensemble and Transition Path Generation

ProtSCAPE-Net Architecture Architecture schematic placeholder - add your diagram here


Overview

ProtSCAPE-Net combines multiple state-of-the-art techniques to learn and generate protein conformational landscapes:

  • SE(3)-Equivariant Graph Networks: Respects the symmetries of 3D protein structures
  • Scattering Transforms: Multi-scale geometric feature extraction
  • Transformer Encoders: Captures long-range dependencies between atoms/residues
  • Latent Diffusion Models: Generates novel conformational ensembles
  • Energy-Guided Path Generation: NEB for transition pathway discovery

Key Features

✨ Structure Reconstruction: Atomic-level protein structure prediction from graph representations
🧬 Conformational Ensemble Generation: Sample diverse protein conformations via latent diffusion
πŸ›€οΈ Transition Path Discovery: Generate minimum energy paths between conformational states
πŸ“Š MolProbity Integration: Automated structure quality assessment
⚑ Efficient Training: PyTorch Lightning with mixed precision and distributed training support


Table of Contents


Installation

Prerequisites

  • Python 3.8 or higher
  • CUDA-capable GPU (recommended)
  • Conda or virtualenv (recommended)

Setup

# Clone the repository
git clone https://github.com/yourusername/ProtSCAPE-Net.git
cd ProtSCAPE-Net

# Create a conda environment
conda create -n protscape python=3.8
conda activate protscape

# Install dependencies
pip install -r requirements.txt

# Install PyTorch Geometric (adjust CUDA version as needed)
pip install torch-geometric torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.13.0+cu117.html

Optional Dependencies

For advanced visualization:

pip install phate>=0.2.5

For MolProbity metrics:

# Requires phenix.molprobity (install separately)
# See: https://www.phenix-online.org/

Quick Start

1. Train a Model

# Use a pre-configured setup
python train.py --config configs/config.yaml

# Or specify a protein
python train.py --config configs/config.yaml --protein 7lp1

2. Run Inference

# Evaluate on test data
python inference.py --config configs/config_inference.yaml --ckpt_path checkpoints/best_model.pt

3. Generate Conformational Ensembles

# Run the complete pipeline: AE training β†’ DDPM training β†’ generation
python ensemble_gen.py --config configs/config_ensemble.py

πŸ“ Project Structure

ProtSCAPE-Net/
β”œβ”€β”€ protscape/              # Core model implementations
β”‚   β”œβ”€β”€ protscape.py        # Main ProtSCAPE model
β”‚   β”œβ”€β”€ autoencoder.py      # Variational autoencoder
β”‚   β”œβ”€β”€ transformer.py      # Transformer encoder
β”‚   β”œβ”€β”€ bottleneck.py       # Latent space bottleneck
β”‚   β”œβ”€β”€ generate.py         # Path generation algorithms
β”‚   β”œβ”€β”€ neb.py              # Nudged Elastic Band
β”‚   └── wavelets.py         # Scattering transform layer
β”œβ”€β”€ utils/                  # Utility functions
β”‚   β”œβ”€β”€ generation_helpers.py
β”‚   β”œβ”€β”€ generation_viz.py
β”‚   β”œβ”€β”€ geometry.py         # Kabsch alignment, RMSD
β”‚   └── config.py           # Configuration loading
β”œβ”€β”€ configs/                # Configuration files
β”‚   β”œβ”€β”€ config.yaml         # Training config
β”‚   β”œβ”€β”€ config_inference.yaml
β”‚   β”œβ”€β”€ config_ensemble.py
β”‚   └── CONFIG_GUIDE.md     # Configuration documentation
β”œβ”€β”€ data/                   # Data preparation scripts
β”‚   β”œβ”€β”€ prepare_atlas.py
β”‚   β”œβ”€β”€ prepare_deshaw.py
β”‚   └── download_*.py
β”œβ”€β”€ docs/                   # Documentation
β”‚   └── PATH_GENERATION_METHODS.md
β”œβ”€β”€ train.py                # Training script
β”œβ”€β”€ inference.py            # Inference/evaluation script
β”œβ”€β”€ ensemble_gen.py         # Ensemble generation pipeline
└── requirements.txt        # Python dependencies

Usage

Training

Train ProtSCAPE on protein conformational data:

python train.py --config configs/config.yaml

Key training parameters (in config.yaml):

  • dataset: Dataset name (e.g., "atlas", "deshaw")
  • protein: Protein ID (e.g., "7lp1", "1bx7")
  • pkl_path: Path to preprocessed graph data
  • latent_dim: Dimensionality of latent space (default: 128)
  • n_epochs: Number of training epochs
  • batch_size: Batch size
  • lr: Learning rate

Training outputs:

  • Checkpoints in checkpoints/
  • Training logs in train_logs/
  • Weights & Biases logging (if configured)

Inference

Evaluate a trained model:

python inference.py --config configs/config_inference.yaml --ckpt_path checkpoints/best_model.pt

Outputs:

  • pdb_frames/: Predicted and ground truth PDB files
  • latents_zrep.npy: Latent space representations
  • energies.npy: Energy values
  • pca_energy.png, phate_energy.png: Dimensionality reduction visualizations

Key metrics:

  • Kabsch-aligned RMSD (Γ…)
  • Coordinate MSE
  • Classification accuracy (atomic number, residue, amino acid)

Ensemble Generation

Generate conformational ensembles using latent diffusion:

python ensemble_gen.py --config configs/config_ensemble.py

Pipeline stages:

  1. Autoencoder Training: Compress conformational space
  2. DDPM Training: Learn generative model in latent space
  3. Sampling: Generate novel conformations
  4. Evaluation: Compute MolProbity scores and structural metrics

Path Generation

Generate transition paths between conformational states:

# LEP method (Langevin dynamics)
python ensemble_gen.py --config configs/config_generation.yaml --method LEP

# NEB method (Nudged Elastic Band)
python ensemble_gen.py --config configs/config_generation_neb.yaml --method NEB

See docs/PATH_GENERATION_METHODS.md for detailed comparison of methods.


Configuration

All parameters are managed via YAML configuration files. See configs/CONFIG_GUIDE.md for detailed documentation.

Example config.yaml:

# Dataset
dataset: "atlas"
protein: "7lp1"
pkl_path: "data/graphs/7lp1_graphs.pkl"

# Model architecture
latent_dim: 128
hidden_dim: 256
embedding_dim: 128
n_layers: 4
n_heads: 8

# Training
n_epochs: 1000
batch_size: 32
lr: 0.0001
weight_decay: 0.0001

# Normalization
normalize_xyz: true
normalize_energy: true

# Logging
wandb_project: "protscape"
save_dir: "checkpoints/"

Command-line overrides:

python train.py --config config.yaml --batch_size 64 --lr 0.0005

Datasets

Supported Datasets

  1. ATLAS: High-quality MD simulations of folding transitions
  2. DE Shaw: Anton ultra-long MD simulations
  3. Custom: Your own molecular dynamics trajectories

Data Preparation

# Download and prepare ATLAS dataset
cd data/
python download_atlas.py
python prepare_atlas.py --protein 7lp1

# Prepare DE Shaw data
python download_deshaw.py
python prepare_deshaw.py --protein ubiquitin

Data format: Preprocessed graphs stored as .pkl files with:

  • x: Node features [atomic_number, residue_idx, aa_idx, xyz(3)]
  • edge_index: Graph connectivity
  • edge_attr: Edge features
  • energy: Potential energy (optional)
  • time: Simulation time (optional)

Path Generation Methods

LEP (Low Energy Path)

Stochastic trajectory generation using Langevin dynamics with momentum in latent space.

Pros: Explores multiple pathways, handles conformational heterogeneity
Cons: Stochastic, may not find true minimum energy path

method: "LEP"
steps: 1000
step_size: 1e-10
momentum: 0.9

NEB (Nudged Elastic Band)

Deterministic optimization to find minimum energy pathways.

Pros: Finds true MEP, identifies transition states
Cons: Deterministic, computationally intensive

method: "NEB"
n_pivots: 20
neb_steps: 50
neb_lr: 0.05

Model Architecture

ProtSCAPE combines several key components:

  1. EGNN Layers: SE(3)-equivariant message passing preserves geometric structure
  2. Scattering Transform: Multi-scale wavelet-based feature extraction
  3. Transformer Encoder: Self-attention over atomic features
  4. Bottleneck Module: Compresses to low-dimensional latent space
  5. Multi-Task Decoder: Predicts atomic features and 3D coordinates

Loss Functions:

  • Cross-entropy for discrete features (atomic number, residue, amino acid)
  • Kabsch-aligned MSE for 3D coordinates (Procrustes distance)
  • Optional energy prediction loss

Evaluation Metrics

Structure Quality

  • Kabsch RMSD: Rotation-invariant coordinate accuracy
  • MolProbity Score: Overall structure quality
  • Clashscore: Steric clash detection
  • Ramachandran: Backbone dihedral angle validation

Latent Space

  • PCA/PHATE: Visualization of learned manifold
  • Energy Correlation: Latent space energy landscape fidelity

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


Contact

For questions or issues, please:


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published