Dec 19, 2023

Public workspaceGraph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs

Graph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs
  • 1Department of Biotechnology, RV College of Engineering, Bangalore- 560059, affiliated to Visvesvaraya Technological University (VTU), Belagavi- 590018;
  • 2Department of Electronics & Telecommunication, RV College of Engineering, Bangalore- 560059, affiliated to Visvesvaraya Technological University (VTU), Belagavi- 590018
Open access
Protocol CitationAnagha S Setlur, Vidya Niranjan, Arjun Balaji, Chandrashekar K 2023. Graph Neural Network Framework for Web-Based Prediction of Protein-Ligand Docking Scores across multiple organs. protocols.io https://dx.doi.org/10.17504/protocols.io.j8nlkoy9xv5r/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: December 15, 2023
Last Modified: December 19, 2023
Protocol Integer ID: 92373
Keywords: QSAR, machine learning and deep learning, graph convolution networks, graph neural networks, data pre-processing, human organs, web-based predictions
Abstract
Estimating the docking score between proteins and drugs is very important in the application of structure-based drug design. This project explores the application of Graph Neural networks (GNN) in the field of molecular property prediction using SMILES representation, the trained models are then deployed on a web-based platform for broader accessibility and use. The primary dataset utilized in this study includes molecular data represented by MolPort IDs and associated docking scores, which are critical in assessing molecular interactions. A significant aspect of this project is data preprocessing, where each molecule, initially represented as a SMILES string, is converted into a graph format. Effective molecular representation learning is pivotal to facilitate molecular property prediction. Models are then evaluated based on various performance metrics and deployed on the web-based platform.

Keywords: QSAR, machine and deep learning, graph convolution networks, graph neural networks, data pre-processing, human organs, web-based predictions
Guidelines
QSAR modeling should be performed for each protein under each organ first. The ligand IDs and SMILES structures are the preferred columns to be present in the analytical dataset.
Safety warnings
Attention
NA
Ethics statement
None.
Before start
Check system compatibility to run pre-processing of models and GCN/hybrid GCN models.
IMPORTING LIBRARIES
IMPORTING LIBRARIES
Import all necessary libraries

Ensure the installation and importation of all the necessary libraries needed for both the data preprocessing and the model training and evaluation. Provided below is a screenshot of the required libraries to be imported.

Importing required libraries
Importing required libraries

DATASET CREATION
DATASET CREATION
In the present scenario, Quantitative Structure Activity Relationship (QSAR) data generated from Schrodinger Maestro was used for dataset creation. QSAR models were first generated for specific proteins and by taking a set of ligands from MolPort.

Taking an example for Brain, O14672. Here, Y(Obs) is the docking score. This dataset has the MolPort IDs and the docking scores obtained from QSAR modeling data.

image.png


Creation of analytical dataset
Using the second dataset containing MolPort IDs and the SMILES string. An analytical dataset was created.

image.png


The following is performed to prepare an analytical dataset:

image.png


The processed dataset looks as follows:

image.png


DATA PRE-PROCESSING
DATA PRE-PROCESSING
SMILES to graph conversion

Data preprocessing is a pivotal step in this model. Each molecule represented by a SMILES string is converted into a graph, with atoms as nodes and chemical bonds as edges. This graph representation is essential for the GNN to accurately interpret molecular structures.

Feature Representation

Atom Features: Each atom is represented by a one-hot encoded feature vector, indicating the atom type. The model considers four types of atoms (C, O, N, B), leading to a 4-dimensional feature vector for each atom.
Bond Features: Bonds are characterized by their type (single, double, triple, aromatic) and their inclusion in a ring structure. Each bond is represented by a 5-dimensional feature vector.

AB
Feature Dimensions
One-hot encoding of atom types (C, O, N, B) 4
Edge features for bond types (single, double, triple, aromatic) 4
Edge features for bond presence in a ring structure 1
Atom features for atom presence in a ring structure 1
Bond indices for atom connectivity 2 per bond

Using RDKit library for feature representation

So, to represent all these features, we utilize the functionalities of the RDKit library. The function converts a SMILES string into a molecular graph, encoding atom types using one-hot encoding and representing bonds with their types and ring membership.

Using RDKit for feature representation
Using RDKit for feature representation


MODEL TRAINING AND EVALUATION
MODEL TRAINING AND EVALUATION
Model defining and training

Define the models and train with early stopping along with appropriate parameters.
MODEL 1- GRAPH CONVOLUTION NETWORK (GCN)

The first model we explore is a Graph Convolution Network (GCN) with 2 convolution layers.

image.png


MODEL 2- HYBRID GCN

The second model we explore is a hybrid GCN model:

image.png


5-fold cross validation

Utilizing 5-Fold cross-validation for training enhancing its robustness and reliability. This method ensured a comprehensive evaluation by systematically partitioning the data into distinct subsets for both training and validation.

image.png


The model's performance was further evaluated using metrics like Root Mean Squared Error (RMSE) and Mean Average Error(MAE), providing insights into its predictive accuracy and overall performance.

PICKING THE BEST MODEL AND UPLOADING IN REPOSITORY
PICKING THE BEST MODEL AND UPLOADING IN REPOSITORY
The best possible model was picked and the weights were saved. Then, these weights were uploaded onto the Streamlit repository.

image.png



These same steps were repeated across different proteins, datasets and models to integrate all models from each human organ into a single platform.
CONCLUSION
CONCLUSION
This protocol briefs the steps required to integrate all predicted QSAR data from each organ into a single, all-in-one platform for all human organs and proteins associated with them, to enable users to provide a SMILES structure and estimate the predicted docking score after mapping with the integrated models. Data pre-processing is the primary step in this protocol, followed by creation of analytical dataset for conversion into graphs. Advanced machine and deep learning technique called the graph convolution network (GCN) is shown as model 1, where high dimensional data is converted to low dimensional data and the graphs are correlated to the target variables (in this case, docking scores). The hybrid model, shown as model 2, also adds an additional concept of attention mechanism, that employs positional encoding along with traditional GCN. The web-application allows users to choose which model to utilise for their prediction. This protocol allows for the direct binding affinity predictions of small molecules to important proteins in the human organs, thereby, providing an overall safety information on the small molecules.
ACKNOWLEDGEMENTS
ACKNOWLEDGEMENTS
The authors thank Mr. Akshay Uttarkar for providing inputs throughout.
Protocol references
1. Kaplan Z, Ehrlich S, Leswing K (2021) Benchmark study of DeepAutoQSAR, ChemProp, and DeepPurpose on the ADMET subset of the Therapeutic Data Commons. https://newsite.schrodinger.com/life-science/learn/white-papers/benchmark-study-deepautoqsar-chemprop-and-deeppurpose-admet-subset-therapeutic-data/

2. Gion K, Gattani S, Kaplan Z (2022) DeepAutoQSAR hardware benchmark. https://newsite.schrodinger.com/materials-science/learn/white-papers/deepautoqsar-hardware-benchmark/

3. Schrödinger Release 2023-4: DeepAutoQSAR, Schrödinger, LLC, New York, NY, 2023.


5. Wu F, Souza A, Zhang T, Fifty C, Yu T, Weinberger K. Simplifying graph convolutional networks. International conference on machine learning 2019 May 24 (pp. 6861-6871). PMLR.

6. Javeed A. A hybrid attention mechanism for multi-target entity relation extraction using graph neural networks. Machine Learning with Applications. 2023 Mar 15;11:100444.