ProtSCAPE-Net - Learning Protein Conformational Landscapes from Molecular Dynamics for Ensemble and Transition Path Generation
Architecture schematic placeholder - add your diagram here
ProtSCAPE-Net combines multiple state-of-the-art techniques to learn and generate protein conformational landscapes:
- SE(3)-Equivariant Graph Networks: Respects the symmetries of 3D protein structures
- Scattering Transforms: Multi-scale geometric feature extraction
- Transformer Encoders: Captures long-range dependencies between atoms/residues
- Latent Diffusion Models: Generates novel conformational ensembles
- Energy-Guided Path Generation: NEB for transition pathway discovery
β¨ Structure Reconstruction: Atomic-level protein structure prediction from graph representations
𧬠Conformational Ensemble Generation: Sample diverse protein conformations via latent diffusion
π€οΈ Transition Path Discovery: Generate minimum energy paths between conformational states
π MolProbity Integration: Automated structure quality assessment
β‘ Efficient Training: PyTorch Lightning with mixed precision and distributed training support
- Python 3.8 or higher
- CUDA-capable GPU (recommended)
- Conda or virtualenv (recommended)
# Clone the repository
git clone https://github.com/yourusername/ProtSCAPE-Net.git
cd ProtSCAPE-Net
# Create a conda environment
conda create -n protscape python=3.8
conda activate protscape
# Install dependencies
pip install -r requirements.txt
# Install PyTorch Geometric (adjust CUDA version as needed)
pip install torch-geometric torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.13.0+cu117.htmlFor advanced visualization:
pip install phate>=0.2.5For MolProbity metrics:
# Requires phenix.molprobity (install separately)
# See: https://www.phenix-online.org/# Use a pre-configured setup
python train.py --config configs/config.yaml
# Or specify a protein
python train.py --config configs/config.yaml --protein 7lp1# Evaluate on test data
python inference.py --config configs/config_inference.yaml --ckpt_path checkpoints/best_model.pt# Run the complete pipeline: AE training β DDPM training β generation
python ensemble_gen.py --config configs/config_ensemble.pyProtSCAPE-Net/
βββ protscape/ # Core model implementations
β βββ protscape.py # Main ProtSCAPE model
β βββ autoencoder.py # Variational autoencoder
β βββ transformer.py # Transformer encoder
β βββ bottleneck.py # Latent space bottleneck
β βββ generate.py # Path generation algorithms
β βββ neb.py # Nudged Elastic Band
β βββ wavelets.py # Scattering transform layer
βββ utils/ # Utility functions
β βββ generation_helpers.py
β βββ generation_viz.py
β βββ geometry.py # Kabsch alignment, RMSD
β βββ config.py # Configuration loading
βββ configs/ # Configuration files
β βββ config.yaml # Training config
β βββ config_inference.yaml
β βββ config_ensemble.py
β βββ CONFIG_GUIDE.md # Configuration documentation
βββ data/ # Data preparation scripts
β βββ prepare_atlas.py
β βββ prepare_deshaw.py
β βββ download_*.py
βββ docs/ # Documentation
β βββ PATH_GENERATION_METHODS.md
βββ train.py # Training script
βββ inference.py # Inference/evaluation script
βββ ensemble_gen.py # Ensemble generation pipeline
βββ requirements.txt # Python dependencies
Train ProtSCAPE on protein conformational data:
python train.py --config configs/config.yamlKey training parameters (in config.yaml):
dataset: Dataset name (e.g., "atlas", "deshaw")protein: Protein ID (e.g., "7lp1", "1bx7")pkl_path: Path to preprocessed graph datalatent_dim: Dimensionality of latent space (default: 128)n_epochs: Number of training epochsbatch_size: Batch sizelr: Learning rate
Training outputs:
- Checkpoints in
checkpoints/ - Training logs in
train_logs/ - Weights & Biases logging (if configured)
Evaluate a trained model:
python inference.py --config configs/config_inference.yaml --ckpt_path checkpoints/best_model.ptOutputs:
pdb_frames/: Predicted and ground truth PDB fileslatents_zrep.npy: Latent space representationsenergies.npy: Energy valuespca_energy.png,phate_energy.png: Dimensionality reduction visualizations
Key metrics:
- Kabsch-aligned RMSD (Γ )
- Coordinate MSE
- Classification accuracy (atomic number, residue, amino acid)
Generate conformational ensembles using latent diffusion:
python ensemble_gen.py --config configs/config_ensemble.pyPipeline stages:
- Autoencoder Training: Compress conformational space
- DDPM Training: Learn generative model in latent space
- Sampling: Generate novel conformations
- Evaluation: Compute MolProbity scores and structural metrics
Generate transition paths between conformational states:
# LEP method (Langevin dynamics)
python ensemble_gen.py --config configs/config_generation.yaml --method LEP
# NEB method (Nudged Elastic Band)
python ensemble_gen.py --config configs/config_generation_neb.yaml --method NEBSee docs/PATH_GENERATION_METHODS.md for detailed comparison of methods.
All parameters are managed via YAML configuration files. See configs/CONFIG_GUIDE.md for detailed documentation.
Example config.yaml:
# Dataset
dataset: "atlas"
protein: "7lp1"
pkl_path: "data/graphs/7lp1_graphs.pkl"
# Model architecture
latent_dim: 128
hidden_dim: 256
embedding_dim: 128
n_layers: 4
n_heads: 8
# Training
n_epochs: 1000
batch_size: 32
lr: 0.0001
weight_decay: 0.0001
# Normalization
normalize_xyz: true
normalize_energy: true
# Logging
wandb_project: "protscape"
save_dir: "checkpoints/"Command-line overrides:
python train.py --config config.yaml --batch_size 64 --lr 0.0005- ATLAS: High-quality MD simulations of folding transitions
- DE Shaw: Anton ultra-long MD simulations
- Custom: Your own molecular dynamics trajectories
# Download and prepare ATLAS dataset
cd data/
python download_atlas.py
python prepare_atlas.py --protein 7lp1
# Prepare DE Shaw data
python download_deshaw.py
python prepare_deshaw.py --protein ubiquitinData format: Preprocessed graphs stored as .pkl files with:
x: Node features [atomic_number, residue_idx, aa_idx, xyz(3)]edge_index: Graph connectivityedge_attr: Edge featuresenergy: Potential energy (optional)time: Simulation time (optional)
Stochastic trajectory generation using Langevin dynamics with momentum in latent space.
Pros: Explores multiple pathways, handles conformational heterogeneity
Cons: Stochastic, may not find true minimum energy path
method: "LEP"
steps: 1000
step_size: 1e-10
momentum: 0.9Deterministic optimization to find minimum energy pathways.
Pros: Finds true MEP, identifies transition states
Cons: Deterministic, computationally intensive
method: "NEB"
n_pivots: 20
neb_steps: 50
neb_lr: 0.05ProtSCAPE combines several key components:
- EGNN Layers: SE(3)-equivariant message passing preserves geometric structure
- Scattering Transform: Multi-scale wavelet-based feature extraction
- Transformer Encoder: Self-attention over atomic features
- Bottleneck Module: Compresses to low-dimensional latent space
- Multi-Task Decoder: Predicts atomic features and 3D coordinates
Loss Functions:
- Cross-entropy for discrete features (atomic number, residue, amino acid)
- Kabsch-aligned MSE for 3D coordinates (Procrustes distance)
- Optional energy prediction loss
- Kabsch RMSD: Rotation-invariant coordinate accuracy
- MolProbity Score: Overall structure quality
- Clashscore: Steric clash detection
- Ramachandran: Backbone dihedral angle validation
- PCA/PHATE: Visualization of learned manifold
- Energy Correlation: Latent space energy landscape fidelity
Contributions are welcome! Please feel free to submit a Pull Request.
For questions or issues, please:
- Open an issue on GitHub
- Contact: [email protected]