A Graph Neural Network-based model for predicting bioisosteric replacement in drug discovery. This repository contains the implementation of a deep learning approach that predicts whether molecular transformations will maintain or improve biological activity across different protein targets.
Bioisosteric replacement is a crucial strategy in medicinal chemistry for optimizing drug candidates. This project implements a graph neural network model that learns to predict the viability of matched molecular pair (MMP) transformations using molecular graph representations.
- Graph-based molecular representation: Uses PyTorch Geometric for efficient graph neural network operations
- Target-aware learning: Handles multiple protein targets with shared molecular representations
- Multiple prediction modes: Supports whole-molecule and fragment-based predictions
- Transfer learning: Fine-tuning capabilities for target-specific predictions
- Baseline comparisons: Includes GBDT (Gradient Boosting Decision Trees) baseline using ECFP4 fingerprints
GraphBioisostere/
├── pro_GNN/ # Main GNN implementation
│ ├── config.py # Model configuration
│ ├── training_cls_ddp_3.py # Main training script (DDP support)
│ ├── finetune_reg.py # Transfer learning script
│ ├── prep_cls_all.py # Data preparation for whole molecules
│ ├── prep_cls_frag_all.py # Data preparation for fragments
│ ├── encoder/ # GNN encoder implementations
│ ├── model/ # Model architectures
│ ├── utils/ # Utility functions
│ └── local/ # Local execution scripts
├── gbdt/ # GBDT baseline implementation
│ ├── prep_ecfp4.py # ECFP4 fingerprint generation
│ ├── prep_ecfp4_frag.py # Fragment ECFP4 generation
│ └── run_gbdt.py # GBDT training and evaluation
├── splitting/ # Data splitting utilities
│ └── data_splitting.py # Target-based 5-fold CV splitting
├── notebooks/ # Jupyter notebooks for analysis
│ ├── data.ipynb # Data exploration and selection
│ └── data_process.ipynb # Data preprocessing
├── test/ # Testing and evaluation scripts
│ └── generate_final_figures.py # Figure generation for results
└── README.md # This file
- Python 3.8+
- CUDA-capable GPU (recommended)
- PyTorch 1.12+
- PyTorch Geometric
- RDKit
- LightGBM (for baseline)
- Clone the repository:
git clone https://github.com/yourusername/GraphBioisostere.git
cd GraphBioisostere- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install torch-geometric torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip install rdkit pandas numpy scikit-learn lightgbm matplotlib seaborn tqdmThe dataset should be downloaded separately and placed in a directory (e.g., MMP_dataset/). The dataset should contain:
dataset_consistentsmiles.csv: Main CSV file with MMP pairs and labels- The CSV should include columns:
smiles1,smiles2,label,tid(target ID), etc.
The expected CSV format:
index,smiles1,smiles2,frag1,frag2,tid,label,delta_value
0,CC(C)Cc1ccc(cc1)C(C)C(O)=O,CC(C)Cc1ccc(cc1)C(C)C(=O)NO,C(C)C(O)=O,C(C)C(=O)NO,1,1,2.5
Convert CSV to PyTorch geometric data format:
cd pro_GNN
python prep_cls_all.py /path/to/MMP_dataset/This generates dataset_consistentsmiles.pt containing graph representations of all molecule pairs.
Create 5-fold cross-validation splits based on target IDs:
cd splitting
python data_splitting.py \
--csv_path /path/to/MMP_dataset/dataset_consistentsmiles.csv \
--output_dir /path/to/pro_GNN/dataset/dataset_cv \
--data_path /path/to/MMP_dataset/dataset_consistentsmiles.pt \
--pkl_output tid_5cv.pkl \
--seed 41This creates:
dataset_cv1.pt,dataset_cv2.pt, ...,dataset_cv5.pt: Train/val/test splits for each foldtid_5cv.pkl: Metadata for the splits
Train the graph neural network model with distributed data parallel (DDP):
cd pro_GNN/local
bash run_ddp.sh 'dataset/dataset_cv1.pt' 'results/cv1/pair_diff' 'pair' 'diff'Parameters:
dataset_cv1.pt: Input dataset fileresults/cv1/pair_diff: Output directorypair: Prediction mode ('pair', 'frag', etc.)diff: Loss type ('diff', 'cat', 'product')
For all 5 folds:
for i in {1..5}; do
bash run_ddp.sh "dataset/dataset_cv${i}.pt" "results/cv${i}/pair_diff" 'pair' 'diff'
doneFor fragment-only predictions:
python prep_cls_frag_all.py /path/to/MMP_dataset/
# Then split and train similarlyTrain LightGBM baseline with ECFP4 fingerprints:
cd gbdt
# Generate fingerprints
python prep_ecfp4.py /path/to/MMP_dataset/ --radius 2 --n_bits 2048
# Train GBDT
python run_gbdt.pyTest predictions are automatically saved during training:
results/cv*/*/test_predictions.npz: Contains predictions, true labels, and indices
To generate evaluation metrics and figures:
cd test
python generate_final_figures.pyFine-tune the pre-trained model on a specific target:
cd pro_GNN
python finetune_reg.py \
--pretrained_model results/cv1/pair_diff/best_model.pth \
--target_data notebooks/target/target_data.pt \
--output_dir results/target/finetuneKey hyperparameters in pro_GNN/config.py:
class Args:
batch_size = 8192 # Batch size
hidden_dim = 64 # Hidden dimension
embedding_dim = 64 # Embedding dimension
num_layers = 2 # Number of GNN layers
dropout = 0.2 # Dropout rate
lr = 1e-3 # Learning rate
epochs = 500 # Maximum epochs
patience = 30 # Early stopping patienceThe trained models produce:
- Classification metrics: Accuracy, Precision, Recall, F1, ROC-AUC, MCC
- Prediction files: NPZ files with predictions for analysis
- Training logs: Loss curves and validation metrics
Results are organized by:
- Cross-validation fold (cv1-cv5)
- Prediction mode (pair, frag, etc.)
- Loss type (diff, cat, product)
For questions or issues, please:
- Open an issue on GitHub
- Contact: